From ed5b54e87e7231cb25a81d5fba16daf54e7af4db Mon Sep 17 00:00:00 2001 From: Bastien Chanot Date: Wed, 24 Jun 2026 14:22:14 +0200 Subject: [PATCH] chore(graphify): update skill to v0.8.45 Bump 0.8.13 -> 0.8.45. Extract the SKILL.md monolith (~530 lines) into references/ for progressive disclosure: github-and-merge, transcribe, extraction-spec, exports, update, query, add-watch, hooks. SKILL.md now points to each reference and loads it only on the path that needs it. Inline fixes carried by the new version: empty-extraction guard before any write (#1392), shrink-guard ordering so GRAPH_REPORT/analysis never describe a graph.json that was refused (#479), root= relativization for build/manifest parity across clones (#1361/#1417), stale-cache cleanup and code-only semantic pre-write (#1392), edge-direction preserving merge (#801). Adds FalkorDB export (--falkordb/--falkordb-push) and rewrites the frontmatter description (drops the obsolete trigger: field). Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_0169vjUD1sP9Nx4ZiCa8wvAw --- skills/graphify/.graphify_version | 2 +- skills/graphify/SKILL.md | 597 ++---------------- skills/graphify/references/add-watch.md | 56 ++ skills/graphify/references/exports.md | 87 +++ skills/graphify/references/extraction-spec.md | 70 ++ .../graphify/references/github-and-merge.md | 46 ++ skills/graphify/references/hooks.md | 33 + skills/graphify/references/query.md | 303 +++++++++ skills/graphify/references/transcribe.md | 52 ++ skills/graphify/references/update.md | 192 ++++++ 10 files changed, 906 insertions(+), 532 deletions(-) create mode 100644 skills/graphify/references/add-watch.md create mode 100644 skills/graphify/references/exports.md create mode 100644 skills/graphify/references/extraction-spec.md create mode 100644 skills/graphify/references/github-and-merge.md create mode 100644 skills/graphify/references/hooks.md create mode 100644 skills/graphify/references/query.md create mode 100644 skills/graphify/references/transcribe.md create mode 100644 skills/graphify/references/update.md diff --git a/skills/graphify/.graphify_version b/skills/graphify/.graphify_version index f806745..827dae8 100644 --- a/skills/graphify/.graphify_version +++ b/skills/graphify/.graphify_version @@ -1 +1 @@ -0.8.13 \ No newline at end of file +0.8.45 \ No newline at end of file diff --git a/skills/graphify/SKILL.md b/skills/graphify/SKILL.md index c3e39b3..6c7060a 100644 --- a/skills/graphify/SKILL.md +++ b/skills/graphify/SKILL.md @@ -1,7 +1,6 @@ --- name: graphify -description: "any input (code, docs, papers, images, videos) to knowledge graph. Use when user asks any question about a codebase, documents, or project content - especially if graphify-out/ exists, treat the question as a /graphify query." -trigger: /graphify +description: "Use for any question about a codebase, its architecture, file relationships, or project content — especially when graphify-out/ exists, where the question should be treated as a graphify query first. Turns any input (code, docs, papers, images, videos) into a persistent knowledge graph with god nodes, community detection, and query/path/explain tools." --- # /graphify @@ -27,6 +26,8 @@ Turn any folder of files into a navigable knowledge graph with community detecti /graphify --graphml # export graph.graphml (Gephi, yEd) /graphify --neo4j # generate graphify-out/cypher.txt for Neo4j /graphify --neo4j-push bolt://localhost:7687 # push directly to Neo4j +/graphify --falkordb # generate graphify-out/cypher.txt for FalkorDB +/graphify --falkordb-push falkordb://localhost:6379 # push directly to FalkorDB /graphify --mcp # start MCP stdio server for agent access /graphify --watch # watch folder, auto-rebuild on code changes (no LLM needed) /graphify --wiki # build agent-crawlable wiki (index.md + one article per community) @@ -57,48 +58,9 @@ If the path argument starts with `https://github.com/` or `http://github.com/`, Follow these steps in order. Do not skip steps. -### Step 0 - Clone GitHub repo(s) (only if a GitHub URL was given) +### Step 0 - GitHub repos and multi-path merge (only if a URL or several paths) -**Single repo:** -```bash -LOCAL_PATH=$(graphify clone [--branch ]) -# Use LOCAL_PATH as the target for all subsequent steps -``` - -**Multiple repos (cross-repo graph):** -```bash -# Clone each repo, run the full pipeline on each, then merge -graphify clone # → ~/.graphify/repos// -graphify clone # → ~/.graphify/repos// -# Run /graphify on each local path to produce their graph.json files -# Then merge: -graphify merge-graphs \ - ~/.graphify/repos///graphify-out/graph.json \ - ~/.graphify/repos///graphify-out/graph.json \ - --out graphify-out/cross-repo-graph.json -``` - -Graphify clones into `~/.graphify/repos//` and reuses existing clones on repeat runs. Each node in the merged graph carries a `repo` attribute so you can filter by origin. - -**Multiple local subfolders (monorepo or multi-service layout):** - -The skill pipeline writes all intermediate and final outputs to `graphify-out/` in the current working directory. Running the skill on each subfolder separately will clobber the same output dir. Instead, use the CLI directly for each subfolder — it places `graphify-out/` *inside* the scanned path: - -```bash -graphify extract ./core/ # → ./core/graphify-out/graph.json -graphify extract ./service/ # → ./service/graphify-out/graph.json -graphify extract ./platform/ # → ./platform/graphify-out/graph.json -# Add --backend gemini|kimi|openai|deepseek|claude-cli depending on which API key you have set - -# Then merge at the project root: -graphify merge-graphs \ - ./core/graphify-out/graph.json \ - ./service/graphify-out/graph.json \ - ./platform/graphify-out/graph.json \ - --out graphify-out/graph.json -``` - -Once `graphify-out/graph.json` exists, the fast path above takes over: any codebase question runs `graphify query` directly on the merged graph — no re-extraction, no size gate. +Only when the path is one or more `https://github.com/...` URLs, or several local subfolders to merge. See `references/github-and-merge.md` for the clone, cross-repo merge, and monorepo flow, then continue with the resolved local path. A plain local path skips this step. ### Step 1 - Ensure graphify is installed @@ -179,50 +141,9 @@ Then act on it: - Otherwise rank by count, show the top 5 with file counts, then ask which subfolder to run on. Wait for the user's answer before proceeding. - Otherwise: proceed directly to Step 2.5 if video files were detected, or Step 3 if not. -### Step 2.5 - Transcribe video / audio files (only if video files detected) +### Step 2.5 - Video and audio (only if video files detected) -Skip this step entirely if `detect` returned zero `video` files. - -Video and audio files cannot be read directly. Transcribe them to text first, then treat the transcripts as doc files in Step 3. - -**Strategy:** Read the god nodes from `graphify-out/.graphify_detect.json` (or the analysis file if it exists from a previous run). You are already a language model — write a one-sentence domain hint yourself from those labels. Then pass it to Whisper as the initial prompt. No separate API call needed. - -**However**, if the corpus has *only* video files and no other docs/code, use the generic fallback prompt: `"Use proper punctuation and paragraph breaks."` - -**Step 1 - Write the Whisper prompt yourself.** - -Read the top god node labels from detect output or analysis, then compose a short domain hint sentence, for example: - -- Labels: `transformer, attention, encoder, decoder` → `"Machine learning research on transformer architectures and attention mechanisms. Use proper punctuation and paragraph breaks."` -- Labels: `kubernetes, deployment, pod, helm` → `"DevOps discussion about Kubernetes deployments and Helm charts. Use proper punctuation and paragraph breaks."` - -Set it as `WHISPER_PROMPT` to use in the next command. - -**Step 2 - Transcribe:** - -```bash -GRAPHIFY_WHISPER_MODEL=base # or whatever --whisper-model the user passed -$(cat graphify-out/.graphify_python) -c " -import json, os -from pathlib import Path -from graphify.transcribe import transcribe_all - -detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text(encoding=\"utf-8\")) -video_files = detect.get('files', {}).get('video', []) -prompt = os.environ.get('GRAPHIFY_WHISPER_PROMPT', 'Use proper punctuation and paragraph breaks.') - -transcript_paths = transcribe_all(video_files, initial_prompt=prompt) -print(json.dumps(transcript_paths, ensure_ascii=False)) -" > graphify-out/.graphify_transcripts.json -``` - -After transcription: -- Read the transcript paths from `graphify-out/.graphify_transcripts.json` -- Add them to the docs list before dispatching semantic subagents in Step 3B -- Print how many transcripts were created: `Transcribed N video file(s) -> treating as docs` -- If transcription fails for a file, print a warning and continue with the rest - -**Whisper model:** Default is `base`. If the user passed `--whisper-model `, set `GRAPHIFY_WHISPER_MODEL=` in the environment before running the command above. +Skip this step entirely if `detect` returned zero `video` files. When the corpus has video or audio, see `references/transcribe.md` to transcribe them to text first, then treat the transcripts as doc files in Step 3. ### Step 3 - Extract entities and relationships @@ -269,7 +190,15 @@ else: #### Part B - Semantic extraction (parallel subagents) -**Fast path:** If detection found zero docs, papers, and images (code-only corpus), skip Part B entirely and go straight to Part C. AST handles code - there is nothing for semantic subagents to do. +**Fast path:** If detection found zero docs, papers, and images (code-only corpus), skip Part B entirely and go straight to Part C. AST handles code - there is nothing for semantic subagents to do. **First write an empty semantic file** so Part C's merge has its input (it reads `.graphify_semantic.json` unconditionally; without this a code-only run hits `FileNotFoundError`): + +```bash +$(cat graphify-out/.graphify_python) -c " +import json +from pathlib import Path +Path('graphify-out/.graphify_semantic.json').write_text(json.dumps({'nodes':[],'edges':[],'hyperedges':[],'input_tokens':0,'output_tokens':0}), encoding='utf-8') +" +``` **MANDATORY: You MUST use the Agent tool here. Reading files yourself one-by-one is forbidden - it is 5-10x slower. If you do not use the Agent tool you are doing this wrong.** @@ -290,12 +219,19 @@ from graphify.cache import check_semantic_cache from pathlib import Path detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text(encoding=\"utf-8\")) -all_files = [f for files in detect['files'].values() for f in files] +# Only content files go to semantic extraction. Code is already covered structurally +# by the AST pass (Part A); flattening every category here makes subagents re-read +# every source file (#1392). Video is transcribed to a document in Step 2.5 first. +all_files = [f for cat in ('document', 'paper', 'image') for f in detect['files'].get(cat, [])] cached_nodes, cached_edges, cached_hyperedges, uncached = check_semantic_cache(all_files) +# Always (re)write the cache file: write hits, else DELETE any leftover from a prior +# run so Part C never merges a stale .graphify_cached.json (#1392). if cached_nodes or cached_edges or cached_hyperedges: Path('graphify-out/.graphify_cached.json').write_text(json.dumps({'nodes': cached_nodes, 'edges': cached_edges, 'hyperedges': cached_hyperedges}, ensure_ascii=False), encoding=\"utf-8\") +else: + Path('graphify-out/.graphify_cached.json').unlink(missing_ok=True) Path('graphify-out/.graphify_uncached.txt').write_text('\n'.join(uncached), encoding=\"utf-8\") print(f'Cache: {len(all_files)-len(uncached)} files hit, {len(uncached)} files need extraction') " @@ -325,76 +261,13 @@ Each subagent receives this exact prompt (substitute FILE_LIST, CHUNK_NUM, TOTAL CHUNK_PATH must be an **absolute** path — derive it before dispatching: ```bash -PROJECT_ROOT=$(cat graphify-out/.graphify_root) +PROJECT_ROOT=$(pwd) # cwd — where Part C globs graphify-out/ (NOT .graphify_root/scan dir, #1392) # Then for chunk N: CHUNK_PATH="${PROJECT_ROOT}/graphify-out/.graphify_chunk_0N.json" ``` Subagent prompt template: -``` -You are a graphify extraction subagent. Read the files listed and extract a knowledge graph fragment. -Output ONLY valid JSON matching the schema below - no explanation, no markdown fences, no preamble. - -Files (chunk CHUNK_NUM of TOTAL_CHUNKS): -FILE_LIST - -Rules: -- EXTRACTED: relationship explicit in source (import, call, citation, "see §3.2") -- INFERRED: reasonable inference (shared data structure, implied dependency) -- AMBIGUOUS: uncertain - flag for review, do not omit - -Code files: focus on semantic edges AST cannot find (call relationships, shared data, arch patterns). - Do not re-extract imports - AST already has those. -Doc/paper files: extract named concepts, entities, citations. For rationale (WHY decisions were made, trade-offs, design intent): store as a `rationale` attribute on the relevant concept node — do NOT create a separate rationale node or fragment node. Only create a node for something that is itself a named entity or concept. Use `file_type:"rationale"` for concept-like nodes (ideas, principles, mechanisms, design patterns). `file_type` MUST be one of exactly these six values: `code`, `document`, `paper`, `image`, `rationale`, `concept`. Any other value is invalid and will be rejected. -Code files: when adding `calls` edges, source MUST be the caller (the function/class doing the calling), target MUST be the callee. Never reverse this direction. -Image files: use vision to understand what the image IS - do not just OCR. - UI screenshot: layout patterns, design decisions, key elements, purpose. - Chart: metric, trend/insight, data source. - Tweet/post: claim as node, author, concepts mentioned. - Diagram: components and connections. - Research figure: what it demonstrates, method, result. - Handwritten/whiteboard: ideas and arrows, mark uncertain readings AMBIGUOUS. - -DEEP_MODE (if --mode deep was given): be aggressive with INFERRED edges - indirect deps, - shared assumptions, latent couplings. Mark uncertain ones AMBIGUOUS instead of omitting. - -Semantic similarity: if two concepts in this chunk solve the same problem or represent the same idea without any structural link (no import, no call, no citation), add a `semantically_similar_to` edge marked INFERRED with a confidence_score reflecting how similar they are (0.6-0.95). Examples: -- Two functions that both validate user input but never call each other -- A class in code and a concept in a paper that describe the same algorithm -- Two error types that handle the same failure mode differently -Only add these when the similarity is genuinely non-obvious and cross-cutting. Do not add them for trivially similar things. - -Hyperedges: if 3 or more nodes clearly participate together in a shared concept, flow, or pattern that is not captured by pairwise edges alone, add a hyperedge to a top-level `hyperedges` array. Examples: -- All classes that implement a common protocol or interface -- All functions in an authentication flow (even if they don't all call each other) -- All concepts from a paper section that form one coherent idea -Use sparingly — only when the group relationship adds information beyond the pairwise edges. Maximum 3 hyperedges per chunk. - -If a file has YAML frontmatter (--- ... ---), copy source_url, captured_at, author, - contributor onto every node from that file. - -confidence_score is REQUIRED on every edge - never omit it, never use 0.5 as a default: -- EXTRACTED edges: confidence_score = 1.0 always -- INFERRED edges: pick exactly ONE value from this set — never 0.5: - 0.95 direct structural evidence (shared data structure, named cross-file reference). - 0.85 strong inference (clear functional alignment, no direct symbol link). - 0.75 reasonable inference (shared problem domain + similar shape, requires interpretation). - 0.65 weak inference (thematically related, no shape evidence). - 0.55 speculative but plausible (surface-level co-occurrence only). - Models follow discrete rubrics better than continuous ranges; the bimodal - distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the - range guidance is being collapsed to a binary. If no value above fits, mark - the edge AMBIGUOUS rather than picking 0.4 or below. -- AMBIGUOUS edges: 0.1-0.3 - -Node ID format: lowercase, only `[a-z0-9_]`, no dots or slashes. Format: `{stem}_{entity}` where stem is `{parent_dir}_{filename_without_ext}` (the **immediate** parent directory name + the filename stem, both lowercased with non-alphanumeric chars replaced by `_`) and entity is the symbol name similarly normalized. Only one level of parent is used — not the full path. Examples: `src/auth/session.py` + `ValidateToken` → `auth_session_validatetoken`; `lib/utils/helpers.py` + `parse_url` → `utils_helpers_parse_url`; `tests/test_foo.py` + `_helper` → `tests_test_foo_helper`. Top-level files (no parent dir, e.g. `setup.py`) use just the filename stem: `setup_my_func`. This must match the ID the AST extractor generates — using just the filename (e.g., `session_validatetoken`) or the full path (e.g., `src_auth_session_validatetoken`) will create orphan ghost-duplicate nodes. If you are re-extracting a project that had ghost duplicates under the old format, the user should run `graphify extract --force` to rebuild cleanly. CRITICAL: never append chunk numbers, sequence numbers, or any suffix to an ID (no `_c1`, `_c2`, `_chunk2`, etc.). IDs must be deterministic from the label alone — the same entity must always produce the same ID regardless of which chunk processes it. - -Generate the extraction JSON matching this schema exactly: -{"nodes":[{"id":"session_validatetoken","label":"Human Readable Name","file_type":"code|document|paper|image|rationale|concept","source_file":"relative/path","source_location":null,"source_url":null,"captured_at":null,"author":null,"contributor":null}],"edges":[{"source":"node_id","target":"node_id","relation":"calls|implements|references|cites|conceptually_related_to|shares_data_with|semantically_similar_to|rationale_for","confidence":"EXTRACTED|INFERRED|AMBIGUOUS","confidence_score":1.0,"source_file":"relative/path","source_location":null,"weight":1.0}],"hyperedges":[{"id":"snake_case_id","label":"Human Readable Label","nodes":["node_id1","node_id2","node_id3"],"relation":"participate_in|implement|form","confidence":"EXTRACTED|INFERRED","confidence_score":0.75,"source_file":"relative/path"}],"input_tokens":0,"output_tokens":0} - -Then write the JSON to disk using the Write tool at this exact absolute path (no relative paths — Write resolves relative paths against an undefined cwd and the file will be silently lost): -CHUNK_PATH -``` +See `references/extraction-spec.md` for the exact subagent prompt (JSON schema, node-ID rules, confidence rubric, frontmatter, hyperedge, and vision rules). Load it only here, only when at least one chunk holds a doc, paper, or image; a pure-code corpus has skipped Part B and never reads it. Pass each subagent that prompt verbatim with FILE_LIST, CHUNK_NUM, TOTAL_CHUNKS, DEEP_MODE, and CHUNK_PATH substituted, and have it write the result to CHUNK_PATH. **Step B3 - Collect, cache, and merge** @@ -511,7 +384,7 @@ print(f'Merged: {total} nodes, {edges} edges ({len(ast[\"nodes\"])} AST + {len(s ### Step 4 - Build graph, cluster, analyze, generate outputs -**Before starting:** note whether `--directed` was given. If so, pass `directed=True` to `build_from_json()` in the code block below. This builds a `DiGraph` that preserves edge direction (source→target) instead of the default undirected `Graph`. +**Before starting:** the code blocks below pass `directed=IS_DIRECTED` to `build_from_json()`. Replace `IS_DIRECTED` with `True` if `--directed` was given (builds a `DiGraph` preserving edge direction source→target), otherwise `False` (the default undirected `Graph`). Substitute it the same way you substitute `INPUT_PATH` — do not leave the literal `IS_DIRECTED` in the code. ```bash mkdir -p graphify-out @@ -527,7 +400,15 @@ from pathlib import Path extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text(encoding=\"utf-8\")) detection = json.loads(Path('graphify-out/.graphify_detect.json').read_text(encoding=\"utf-8\")) -G = build_from_json(extraction) +# root= mirrors the --update runbook (#1361): relativize source_file to the same +# base so the full build and incremental --update never drift apart on re-extract. +G = build_from_json(extraction, root='INPUT_PATH', directed=IS_DIRECTED) +# Guard BEFORE any write: an empty extraction must not clobber a good graph.json / +# GRAPH_REPORT.md / analysis sidecar. Check immediately after build (#1392). +if G.number_of_nodes() == 0: + print('ERROR: Graph is empty - extraction produced no nodes.') + print('Possible causes: all files were skipped, binary-only corpus, or extraction failed.') + raise SystemExit(1) communities = cluster(G) cohesion = score_all(G, communities) tokens = {'input': extraction.get('input_tokens', 0), 'output': extraction.get('output_tokens', 0)} @@ -537,10 +418,17 @@ labels = {cid: 'Community ' + str(cid) for cid in communities} # Placeholder questions - regenerated with real labels in Step 5 questions = suggest_questions(G, communities, labels) +# Export FIRST and honor the #479 shrink-guard: to_json returns False (writing +# nothing) when the new graph is smaller than the existing graph.json. Only write +# GRAPH_REPORT.md + the analysis sidecar when the graph was actually written, so +# they never describe a graph that graph.json doesn't contain (#1392). +wrote = to_json(G, communities, 'graphify-out/graph.json') +if not wrote: + print('ERROR: refused to shrink graphify-out/graph.json (existing graph has more nodes; #479).') + print('If this shrink is intentional (you deleted files), re-run a full build with --force.') + raise SystemExit(1) report = generate(G, communities, cohesion, labels, gods, surprises, detection, tokens, 'INPUT_PATH', suggested_questions=questions) Path('graphify-out/GRAPH_REPORT.md').write_text(report, encoding=\"utf-8\") -to_json(G, communities, 'graphify-out/graph.json') - analysis = { 'communities': {str(k): v for k, v in communities.items()}, 'cohesion': {str(k): v for k, v in cohesion.items()}, @@ -549,10 +437,6 @@ analysis = { 'questions': questions, } Path('graphify-out/.graphify_analysis.json').write_text(json.dumps(analysis, indent=2, ensure_ascii=False), encoding=\"utf-8\") -if G.number_of_nodes() == 0: - print('ERROR: Graph is empty - extraction produced no nodes.') - print('Possible causes: all files were skipped, binary-only corpus, or extraction failed.') - raise SystemExit(1) print(f'Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges, {len(communities)} communities') " ``` @@ -580,7 +464,8 @@ extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text(en detection = json.loads(Path('graphify-out/.graphify_detect.json').read_text(encoding=\"utf-8\")) analysis = json.loads(Path('graphify-out/.graphify_analysis.json').read_text(encoding=\"utf-8\")) -G = build_from_json(extraction) +# root= as in Step 4 / the --update runbook (#1361) — same base for node-key parity. +G = build_from_json(extraction, root='INPUT_PATH', directed=IS_DIRECTED) communities = {int(k): v for k, v in analysis['communities'].items()} cohesion = {int(k): v for k, v in analysis['cohesion'].items()} tokens = {'input': extraction.get('input_tokens', 0), 'output': extraction.get('output_tokens', 0)} @@ -621,73 +506,9 @@ graphify export html # auto-aggregates to community view if graph > 5000 nodes # or: graphify export html --no-viz ``` -### Step 6b - Wiki (only if --wiki flag) +### Steps 6b-8 - Wiki, Neo4j, FalkorDB, SVG, GraphML, MCP, benchmark (only on their flags) -**Only run this step if `--wiki` was explicitly given in the original command.** - -Run this before Step 9 (cleanup) so `.graphify_labels.json` is still available. - -```bash -graphify export wiki -``` - -### Step 7 - Neo4j export (only if --neo4j or --neo4j-push flag) - -**If `--neo4j`** - generate a Cypher file for manual import: - -```bash -graphify export neo4j -``` - -**If `--neo4j-push `** - push directly to a running Neo4j instance. Ask the user for credentials if not provided: - -```bash -graphify export neo4j --push bolt://localhost:7687 --user neo4j --password PASSWORD -``` - -Default URI is `bolt://localhost:7687`, default user is `neo4j`. Uses MERGE - safe to re-run without creating duplicates. - -### Step 7b - SVG export (only if --svg flag) - -```bash -graphify export svg -``` - -### Step 7c - GraphML export (only if --graphml flag) - -```bash -graphify export graphml -``` - -### Step 7d - MCP server (only if --mcp flag) - -```bash -python3 -m graphify.serve graphify-out/graph.json -``` - -This starts a stdio MCP server that exposes tools: `query_graph`, `get_node`, `get_neighbors`, `get_community`, `god_nodes`, `graph_stats`, `shortest_path`. Add to Claude Desktop or any MCP-compatible agent orchestrator so other agents can query the graph live. - -To configure in Claude Desktop, add to `claude_desktop_config.json`: -```json -{ - "mcpServers": { - "graphify": { - "command": "python3", - "args": ["-m", "graphify.serve", "/absolute/path/to/graphify-out/graph.json"] - } - } -} -``` - -### Step 8 - Token reduction benchmark (only if total_words > 5000) - -If `total_words` from `graphify-out/.graphify_detect.json` is greater than 5,000, run: - -```bash -graphify benchmark -``` - -Print the output directly in chat. If `total_words <= 5000`, skip silently - the graph value is structural clarity, not token compression, for small corpora. +These run only when their flag is present (`--wiki`, `--neo4j`/`--neo4j-push`, `--falkordb`/`--falkordb-push`, `--svg`, `--graphml`, `--mcp`) or, for the token-reduction benchmark, when `total_words` exceeds 5,000. A default run with no export flags skips all of them. See `references/exports.md` for each one. Run any `--wiki` export before Step 9 cleanup so `.graphify_labels.json` is still available. --- @@ -704,7 +525,10 @@ from graphify.detect import save_manifest detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text(encoding=\"utf-8\")) # In --update mode, 'all_files' carries the full corpus; 'files' is the changed # subset. Full-rebuild mode populates only 'files', so the fallback handles that. -save_manifest(detect.get('all_files') or detect['files']) +# root= relativizes the manifest keys to the scan root (same base as the build), +# so the on-disk manifest is portable across clones/machines and a later --update +# matches cached files instead of missing every one (#1417). +save_manifest(detect.get('all_files') or detect['files'], root='INPUT_PATH') # Update cumulative cost tracker extract = json.loads(Path('graphify-out/.graphify_extract.json').read_text(encoding=\"utf-8\")) @@ -730,10 +554,13 @@ cost_path.write_text(json.dumps(cost, indent=2, ensure_ascii=False), encoding=\" print(f'This run: {input_tok:,} input tokens, {output_tok:,} output tokens') print(f'All time: {cost[\"total_input_tokens\"]:,} input, {cost[\"total_output_tokens\"]:,} output ({len(cost[\"runs\"])} runs)') " -rm -f graphify-out/.graphify_detect.json graphify-out/.graphify_extract.json graphify-out/.graphify_ast.json graphify-out/.graphify_semantic.json graphify-out/.graphify_analysis.json graphify-out/.graphify_chunk_*.json +rm -f graphify-out/.graphify_detect.json graphify-out/.graphify_extract.json graphify-out/.graphify_ast.json graphify-out/.graphify_semantic.json graphify-out/.graphify_analysis.json +find graphify-out -maxdepth 1 -name '.graphify_chunk_*.json' -delete 2>/dev/null rm -f graphify-out/.needs_update 2>/dev/null || true ``` +Replace INPUT_PATH with the actual path (same value used in Steps 4-5) so the manifest is relativized to the scan root. + Tell the user (omit the obsidian line unless --obsidian was given): ``` Graph complete. Outputs in PATH_TO_DIR/graphify-out/ @@ -783,325 +610,33 @@ if [ ! -f graphify-out/.graphify_python ]; then fi ``` -## For --update (incremental re-extraction) +## For --update and --cluster-only -Use when you've added or modified files since the last run. Only re-extracts changed files - saves tokens and time. - -```bash -$(cat graphify-out/.graphify_python) -c " -import sys, json -from graphify.detect import detect_incremental, save_manifest -from pathlib import Path - -result = detect_incremental(Path('INPUT_PATH')) -new_total = result.get('new_total', 0) -print(json.dumps(result, indent=2, ensure_ascii=False)) -Path('graphify-out/.graphify_incremental.json').write_text(json.dumps(result, ensure_ascii=False), encoding=\"utf-8\") -deleted = list(result.get('deleted_files', [])) -if new_total == 0 and not deleted: - print('No files changed since last run. Nothing to update.') - raise SystemExit(0) -if deleted: - print(f'{len(deleted)} deleted file(s) to prune.') -if new_total > 0: - print(f'{new_total} new/changed file(s) to re-extract.') -" -``` - -Then populate `.graphify_detect.json` so Steps 3A–6 (which read it unconditionally) see the right state for an incremental run. `files` carries the changed subset (drives Step 3A AST + Step 3B0 cache check on only what changed); `all_files` carries the full corpus for any step that needs corpus-wide context: - -```bash -$(cat graphify-out/.graphify_python) -c " -import json -from pathlib import Path -r = json.loads(Path('graphify-out/.graphify_incremental.json').read_text(encoding=\"utf-8\")) -Path('graphify-out/.graphify_detect.json').write_text(json.dumps({ - 'files': r.get('new_files', {}), - 'all_files': r.get('files', {}), - 'total_files': r.get('new_total', 0), - 'total_words': r.get('total_words', 0), - 'skipped_sensitive': r.get('skipped_sensitive', []), - 'needs_graph': True, -}, ensure_ascii=False), encoding=\"utf-8\") -" -``` - -If new files exist, first check whether all changed files are code files: - -```bash -$(cat graphify-out/.graphify_python) -c " -import json -from pathlib import Path - -result = json.loads(open('graphify-out/.graphify_incremental.json', encoding='utf-8').read()) if Path('graphify-out/.graphify_incremental.json').exists() else {} -code_exts = {'.py','.ts','.js','.go','.rs','.java','.cpp','.c','.rb','.swift','.kt','.cs','.scala','.php','.cc','.cxx','.hpp','.h','.kts','.lua','.toc','.f','.F','.f90','.F90','.f95','.F95','.f03','.F03','.f08','.F08'} -new_files = result.get('new_files', {}) -all_changed = [f for files in new_files.values() for f in files] -code_only = all(Path(f).suffix.lower() in code_exts for f in all_changed) -print('code_only:', code_only) -" -``` - -If `code_only` is True: print `[graphify update] Code-only changes detected - skipping semantic extraction (no LLM needed)`, run only Step 3A (AST) on the changed files, skip Step 3B entirely (no subagents), then go straight to merge and Steps 4–8. - -If `code_only` is False (any changed file is a doc/paper/image): run the full Steps 3A–3C pipeline as normal. - - -If no new files exist (only deletions), create an empty extraction so the merge step can prune: - -```bash -if [ ! -f graphify-out/.graphify_extract.json ]; then - echo '[graphify update] Only deletions -- creating empty extraction for merge.' - $(cat graphify-out/.graphify_python) -c " -import json -from pathlib import Path -Path('graphify-out/.graphify_extract.json').write_text(json.dumps({'nodes':[],'edges':[],'hyperedges':[],'input_tokens':0,'output_tokens':0}), encoding='utf-8') -" -fi -``` - - -Then: - -```bash -$(cat graphify-out/.graphify_python) -c " -import json -from pathlib import Path -from graphify.build import build_merge -from graphify.detect import save_manifest - -# Load new extraction and incremental state -new_extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text(encoding=\"utf-8\")) -incremental = json.loads(Path('graphify-out/.graphify_incremental.json').read_text(encoding=\"utf-8\")) -deleted = list(incremental.get('deleted_files', [])) - -# Use build_merge() — reads graph.json directly without NetworkX round-trip -# so edge direction (calls, implements, imports) is always preserved (#801). -G = build_merge( - [new_extraction], - graph_path='graphify-out/graph.json', - prune_sources=deleted or None, -) -print(f'[graphify update] Merged: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges') - -# Write merged result back to .graphify_extract.json so Step 4 sees the full graph -merged_out = { - 'nodes': [{'id': n, **d} for n, d in G.nodes(data=True)], - 'edges': [ - # Explicit source/target last so they win over any stale attrs in d. - {**{k: val for k, val in d.items() if k not in ('_src', '_tgt', 'source', 'target')}, - 'source': d.get('_src', u), 'target': d.get('_tgt', v)} - for u, v, d in G.edges(data=True) - ], - # G.graph["hyperedges"] holds hyperedges from both existing graph.json - # and new_extraction (build_merge combines them). Falling back to - # new_extraction only would silently drop prior-run hyperedges (#801). - 'hyperedges': list(G.graph.get('hyperedges', [])), - 'input_tokens': new_extraction.get('input_tokens', 0), - 'output_tokens': new_extraction.get('output_tokens', 0), -} -Path('graphify-out/.graphify_extract.json').write_text(json.dumps(merged_out, ensure_ascii=False), encoding=\"utf-8\") -print(f'[graphify update] Merged extraction written ({len(merged_out[\"nodes\"])} nodes, {len(merged_out[\"edges\"])} edges)') - -# Save manifest so next --update diffs against today's state, not the -# prior run's baseline (prevents ghost-node reports on subsequent updates). -save_manifest(incremental['files']) -print('[graphify update] Manifest saved.') -" -``` - -Then run Steps 4–8 on the merged graph as normal. - -After Step 4, show the graph diff: - -```bash -$(cat graphify-out/.graphify_python) -c " -import json -from graphify.analyze import graph_diff -from graphify.build import build_from_json -from networkx.readwrite import json_graph -import networkx as nx -from pathlib import Path - -# Load old graph (before update) from backup written before merge -old_data = json.loads(Path('graphify-out/.graphify_old.json').read_text(encoding=\"utf-8\")) if Path('graphify-out/.graphify_old.json').exists() else None -new_extract = json.loads(Path('graphify-out/.graphify_extract.json').read_text(encoding=\"utf-8\")) -G_new = build_from_json(new_extract) - -if old_data: - G_old = json_graph.node_link_graph(old_data, edges='links') - diff = graph_diff(G_old, G_new) - print(diff['summary']) - if diff['new_nodes']: - print('New nodes:', ', '.join(n['label'] for n in diff['new_nodes'][:5])) - if diff['new_edges']: - print('New edges:', len(diff['new_edges'])) -" -``` - -Before the merge step, save the old graph: `cp graphify-out/graph.json graphify-out/.graphify_old.json` -Clean up after: `rm -f graphify-out/.graphify_old.json` - ---- - -## For --cluster-only - -Skip Steps 1–3. Re-run clustering on the existing graph: - -```bash -graphify cluster-only . -``` - -Then run Steps 5–9 as normal (label communities, generate viz, benchmark, clean up, report). +Both are non-default subcommands. `--update` re-extracts only new or changed files; `--cluster-only` reruns clustering on the existing graph. See `references/update.md` for both flows. --- ## For /graphify query -Two traversal modes - choose based on the question: - -| Mode | Flag | Best for | -|------|------|----------| -| BFS (default) | _(none)_ | "What is X connected to?" - broad context, nearest neighbors first | -| DFS | `--dfs` | "How does X reach Y?" - trace a specific chain or dependency path | +When `graphify-out/graph.json` already exists and the user asks a question about the corpus, answer from the graph rather than rebuilding it: ```bash -graphify query "QUESTION" -# or: graphify query "QUESTION" --dfs --budget 3000 +graphify query "" ``` -Replace `QUESTION` with the user's actual question. Answer using **only** what the graph output contains. Quote `source_location` when citing a specific fact. If the graph lacks enough information, say so - do not hallucinate edges. - -After writing the answer, save it back into the graph so it improves future queries: - -```bash -$(cat graphify-out/.graphify_python) -m graphify save-result --question "QUESTION" --answer "ANSWER" --type query --nodes NODE1 NODE2 -``` - -Replace `QUESTION` with the question, `ANSWER` with your full answer text, `SOURCE_NODES` with the list of node labels you cited. This closes the feedback loop: the next `--update` will extract this Q&A as a node in the graph. +Before traversal, expand the question against the graph's own vocabulary so a wording mismatch does not collapse the answer to noise. If the `graphify query` CLI is unavailable, fall back to an inline NetworkX traversal of `graphify-out/graph.json`. Answer using only what the graph output contains, and quote `source_location` when citing a specific fact. For that vocab-expansion step, the BFS/DFS traversal modes, the `--budget` cap, the NetworkX fallback, `save-result` feedback, and the `/graphify path` and `/graphify explain` flows, see `references/query.md`. --- -## For /graphify path +## For /graphify add and --watch -Find the shortest path between two named concepts in the graph. - -```bash -graphify path "NODE_A" "NODE_B" -``` - -Replace `NODE_A` and `NODE_B` with the actual concept names. Then explain the path in plain language - what each hop means, why it's significant. - -After writing the explanation, save it back: - -```bash -$(cat graphify-out/.graphify_python) -m graphify save-result --question "Path from NODE_A to NODE_B" --answer "ANSWER" --type path_query --nodes NODE_A NODE_B -``` +Neither is part of the default build. When the user runs `/graphify add ` to fetch a URL into the corpus, or passes `--watch` to auto-rebuild on file changes, see `references/add-watch.md`. --- -## For /graphify explain +## For the commit hook and native CLAUDE.md integration -Give a plain-language explanation of a single node - everything connected to it. - -```bash -graphify explain "NODE_NAME" -``` - -Replace `NODE_NAME` with the concept the user asked about. Then write a 3-5 sentence explanation of what this node is, what it connects to, and why those connections are significant. Use the source locations as citations. - -After writing the explanation, save it back: - -```bash -$(cat graphify-out/.graphify_python) -m graphify save-result --question "Explain NODE_NAME" --answer "ANSWER" --type explain --nodes NODE_NAME -``` - ---- - -## For /graphify add - -Fetch a URL and add it to the corpus, then update the graph. - -```bash -$(cat graphify-out/.graphify_python) -c " -import sys -from graphify.ingest import ingest -from pathlib import Path - -try: - out = ingest('URL', Path('./raw'), author='AUTHOR', contributor='CONTRIBUTOR') - print(f'Saved to {out}') -except ValueError as e: - print(f'error: {e}', file=sys.stderr) - sys.exit(1) -except RuntimeError as e: - print(f'error: {e}', file=sys.stderr) - sys.exit(1) -" -``` - -Replace `URL` with the actual URL, `AUTHOR` with the user's name if provided, `CONTRIBUTOR` likewise. If the command exits with an error, tell the user what went wrong - do not silently continue. After a successful save, automatically run the `--update` pipeline on `./raw` to merge the new file into the existing graph. - -Supported URL types (auto-detected): -- YouTube / any video URL → audio downloaded via yt-dlp, transcribed to `.txt` on next run (requires `pip install 'graphifyy[video]'`) -- Twitter/X → fetched via oEmbed, saved as `.md` with tweet text and author -- arXiv → abstract + metadata saved as `.md` -- PDF → downloaded as `.pdf` -- Images (.png/.jpg/.webp) → downloaded, Claude vision extracts on next run -- Any webpage → converted to markdown via html2text - ---- - -## For --watch - -Start a background watcher that monitors a folder and auto-updates the graph when files change. - -```bash -python3 -m graphify.watch INPUT_PATH --debounce 3 -``` - -Replace INPUT_PATH with the folder to watch. Behavior depends on what changed: - -- **Code files only (.py, .ts, .go, etc.):** re-runs AST extraction + rebuild + cluster immediately, no LLM needed. `graph.json` and `GRAPH_REPORT.md` are updated automatically. -- **Docs, papers, or images:** writes a `graphify-out/needs_update` flag and prints a notification to run `/graphify --update` (LLM semantic re-extraction required). - -Debounce (default 3s): waits until file activity stops before triggering, so a wave of parallel agent writes doesn't trigger a rebuild per file. - -Press Ctrl+C to stop. - -For agentic workflows: run `--watch` in a background terminal. Code changes from agent waves are picked up automatically between waves. If agents are also writing docs or notes, you'll need a manual `/graphify --update` after those waves. - ---- - -## For git commit hook - -Install a post-commit hook that auto-rebuilds the graph after every commit. No background process needed - triggers once per commit, works with any editor. - -```bash -graphify hook install # install -graphify hook uninstall # remove -graphify hook status # check -``` - -After every `git commit`, the hook detects which code files changed (via `git diff HEAD~1`), re-runs AST extraction on those files, and rebuilds `graph.json` and `GRAPH_REPORT.md`. Doc/image changes are ignored by the hook - run `/graphify --update` manually for those. - -If a post-commit hook already exists, graphify appends to it rather than replacing it. - ---- - -## For native CLAUDE.md integration - -Run once per project to make graphify always-on in Claude Code sessions: - -```bash -graphify claude install -``` - -This writes a `## graphify` section to the local `CLAUDE.md` that instructs Claude to check the graph before answering codebase questions and rebuild it after code changes. No manual `/graphify` needed in future sessions. - -```bash -graphify claude uninstall # remove the section -``` +When the user asks to install the post-commit auto-rebuild hook or wire graphify into a project's CLAUDE.md, see `references/hooks.md`. --- diff --git a/skills/graphify/references/add-watch.md b/skills/graphify/references/add-watch.md new file mode 100644 index 0000000..7784434 --- /dev/null +++ b/skills/graphify/references/add-watch.md @@ -0,0 +1,56 @@ +# graphify reference: add a URL and watch a folder + +Load this when the user ran `/graphify add ` or passed `--watch`. Neither is part of the default build. + +## For /graphify add + +Fetch a URL and add it to the corpus, then update the graph. + +```bash +$(cat graphify-out/.graphify_python) -c " +import sys +from graphify.ingest import ingest +from pathlib import Path + +try: + out = ingest('URL', Path('./raw'), author='AUTHOR', contributor='CONTRIBUTOR') + print(f'Saved to {out}') +except ValueError as e: + print(f'error: {e}', file=sys.stderr) + sys.exit(1) +except RuntimeError as e: + print(f'error: {e}', file=sys.stderr) + sys.exit(1) +" +``` + +Replace `URL` with the actual URL, `AUTHOR` with the user's name if provided, `CONTRIBUTOR` likewise. If the command exits with an error, tell the user what went wrong - do not silently continue. After a successful save, automatically run the `--update` pipeline on `./raw` to merge the new file into the existing graph. + +Supported URL types (auto-detected): +- YouTube / any video URL → audio downloaded via yt-dlp, transcribed to `.txt` on next run (requires `pip install 'graphifyy[video]'`) +- Twitter/X → fetched via oEmbed, saved as `.md` with tweet text and author +- arXiv → abstract + metadata saved as `.md` +- PDF → downloaded as `.pdf` +- Images (.png/.jpg/.webp) → downloaded, Claude vision extracts on next run +- Any webpage → converted to markdown via html2text + +--- + +## For --watch + +Start a background watcher that monitors a folder and auto-updates the graph when files change. + +```bash +$(cat graphify-out/.graphify_python) -m graphify.watch INPUT_PATH --debounce 3 +``` + +Replace INPUT_PATH with the folder to watch. Behavior depends on what changed: + +- **Code files only (.py, .ts, .go, etc.):** re-runs AST extraction + rebuild + cluster immediately, no LLM needed. `graph.json` and `GRAPH_REPORT.md` are updated automatically. +- **Docs, papers, or images:** writes a `graphify-out/needs_update` flag and prints a notification to run `/graphify --update` (LLM semantic re-extraction required). + +Debounce (default 3s): waits until file activity stops before triggering, so a wave of parallel agent writes doesn't trigger a rebuild per file. + +Press Ctrl+C to stop. + +For agentic workflows: run `--watch` in a background terminal. Code changes from agent waves are picked up automatically between waves. If agents are also writing docs or notes, you'll need a manual `/graphify --update` after those waves. diff --git a/skills/graphify/references/exports.md b/skills/graphify/references/exports.md new file mode 100644 index 0000000..242ff86 --- /dev/null +++ b/skills/graphify/references/exports.md @@ -0,0 +1,87 @@ +# graphify reference: extra exports and benchmark + +Load this when the user passed one of the export flags (`--wiki`, `--neo4j`, `--neo4j-push`, `--falkordb`, `--falkordb-push`, `--svg`, `--graphml`, `--mcp`), or when the corpus is large enough for the token-reduction benchmark. Each step runs only for its own flag. + +### Step 6b - Wiki (only if --wiki flag) + +**Only run this step if `--wiki` was explicitly given in the original command.** + +Run this before Step 9 (cleanup) so `.graphify_labels.json` is still available. + +```bash +graphify export wiki +``` + +### Step 7 - Neo4j export (only if --neo4j or --neo4j-push flag) + +**If `--neo4j`** - generate a Cypher file for manual import: + +```bash +graphify export neo4j +``` + +**If `--neo4j-push `** - push directly to a running Neo4j instance. Ask the user for credentials if not provided: + +```bash +graphify export neo4j --push bolt://localhost:7687 --user neo4j --password PASSWORD +``` + +Default URI is `bolt://localhost:7687`, default user is `neo4j`. Uses MERGE - safe to re-run without creating duplicates. + +### Step 7a - FalkorDB export (only if --falkordb or --falkordb-push flag) + +**If `--falkordb`** - generate a Cypher file. The statements are OpenCypher, but FalkorDB's `GRAPH.QUERY` runs one statement at a time (no bulk script import like Neo4j's `cypher-shell`), so prefer `--falkordb-push` to load a graph. Use this only when you want the portable `cypher.txt` artifact: + +```bash +graphify export falkordb +``` + +**If `--falkordb-push `** - push directly to a running FalkorDB instance. Credentials are optional; ask the user only if the instance requires auth: + +```bash +graphify export falkordb --push falkordb://localhost:6379 +``` + +Default URI is `falkordb://localhost:6379` (the scheme is informational - `redis://` or a bare `host:port` work too), auth is optional, and the target graph defaults to `graphify`. Uses MERGE - safe to re-run without creating duplicates. + +### Step 7b - SVG export (only if --svg flag) + +```bash +graphify export svg +``` + +### Step 7c - GraphML export (only if --graphml flag) + +```bash +graphify export graphml +``` + +### Step 7d - MCP server (only if --mcp flag) + +```bash +$(cat graphify-out/.graphify_python) -m graphify.serve graphify-out/graph.json +``` + +This starts a stdio MCP server that exposes tools: `query_graph`, `get_node`, `get_neighbors`, `get_community`, `god_nodes`, `graph_stats`, `shortest_path`. Add to Claude Desktop or any MCP-compatible agent orchestrator so other agents can query the graph live. + +To configure in Claude Desktop, add to `claude_desktop_config.json`. Claude Desktop can't run `$(...)`, and under `uv tool install` the system `python3` can't import graphify — so set `command` to the **absolute interpreter path** printed by `cat graphify-out/.graphify_python`: +```json +{ + "mcpServers": { + "graphify": { + "command": "", + "args": ["-m", "graphify.serve", "/absolute/path/to/graphify-out/graph.json"] + } + } +} +``` + +### Step 8 - Token reduction benchmark (only if total_words > 5000) + +If `total_words` from `graphify-out/.graphify_detect.json` is greater than 5,000, run: + +```bash +graphify benchmark +``` + +Print the output directly in chat. If `total_words <= 5000`, skip silently - the graph value is structural clarity, not token compression, for small corpora. diff --git a/skills/graphify/references/extraction-spec.md b/skills/graphify/references/extraction-spec.md new file mode 100644 index 0000000..2cc1919 --- /dev/null +++ b/skills/graphify/references/extraction-spec.md @@ -0,0 +1,70 @@ +# graphify reference: extraction subagent prompt + +Load this in Step 3 Part B when the corpus has at least one doc, paper, or image chunk. A pure-code corpus skips Part B and never reads this file. Each semantic subagent receives the prompt below verbatim (substitute FILE_LIST, CHUNK_NUM, TOTAL_CHUNKS, DEEP_MODE, and CHUNK_PATH). + +``` +You are a graphify extraction subagent. Read the files listed and extract a knowledge graph fragment. +Output ONLY valid JSON matching the schema below - no explanation, no markdown fences, no preamble. + +Files (chunk CHUNK_NUM of TOTAL_CHUNKS): +FILE_LIST + +Rules: +- EXTRACTED: relationship explicit in source (import, call, citation, "see §3.2") +- INFERRED: reasonable inference (shared data structure, implied dependency) +- AMBIGUOUS: uncertain - flag for review, do not omit + +Code files: focus on semantic edges AST cannot find (call relationships, shared data, arch patterns). + Do not re-extract imports - AST already has those. +Doc/paper files: extract named concepts, entities, citations. For rationale (WHY decisions were made, trade-offs, design intent): store as a `rationale` attribute on the relevant concept node — do NOT create a separate rationale node or fragment node. Only create a node for something that is itself a named entity or concept. Use `file_type:"rationale"` for concept-like nodes (ideas, principles, mechanisms, design patterns). `file_type` MUST be one of exactly these six values: `code`, `document`, `paper`, `image`, `rationale`, `concept`. Any other value is invalid and will be rejected. +Code files: when adding `calls` edges, source MUST be the caller (the function/class doing the calling), target MUST be the callee. Never reverse this direction. `calls` edges MUST stay within one language: a Python function cannot `calls` a JS/TS/Go/Rust/Java symbol and vice versa — cross-language call edges are phantom artifacts, never emit them. +Image files: use vision to understand what the image IS - do not just OCR. + UI screenshot: layout patterns, design decisions, key elements, purpose. + Chart: metric, trend/insight, data source. + Tweet/post: claim as node, author, concepts mentioned. + Diagram: components and connections. + Research figure: what it demonstrates, method, result. + Handwritten/whiteboard: ideas and arrows, mark uncertain readings AMBIGUOUS. + +DEEP_MODE (if --mode deep was given): be aggressive with INFERRED edges - indirect deps, + shared assumptions, latent couplings. Mark uncertain ones AMBIGUOUS instead of omitting. + +Semantic similarity: if two concepts in this chunk solve the same problem or represent the same idea without any structural link (no import, no call, no citation), add a `semantically_similar_to` edge marked INFERRED with a confidence_score reflecting how similar they are (0.6-0.95). Examples: +- Two functions that both validate user input but never call each other +- A class in code and a concept in a paper that describe the same algorithm +- Two error types that handle the same failure mode differently +Only add these when the similarity is genuinely non-obvious and cross-cutting. Do not add them for trivially similar things. + +Hyperedges: if 3 or more nodes clearly participate together in a shared concept, flow, or pattern that is not captured by pairwise edges alone, add a hyperedge to a top-level `hyperedges` array. Examples: +- All classes that implement a common protocol or interface +- All functions in an authentication flow (even if they don't all call each other) +- All concepts from a paper section that form one coherent idea +Use sparingly — only when the group relationship adds information beyond the pairwise edges. Maximum 3 hyperedges per chunk. + +If a file has YAML frontmatter (--- ... ---), copy source_url, captured_at, author, + contributor onto every node from that file. + +confidence_score is REQUIRED on every edge - never omit it, never use 0.5 as a default: +- EXTRACTED edges: confidence_score = 1.0 always +- INFERRED edges: pick exactly ONE value from this set — never 0.5: + 0.95 direct structural evidence (shared data structure, named cross-file reference). + 0.85 strong inference (clear functional alignment, no direct symbol link). + 0.75 reasonable inference (shared problem domain + similar shape, requires interpretation). + 0.65 weak inference (thematically related, no shape evidence). + 0.55 speculative but plausible (surface-level co-occurrence only). + Models follow discrete rubrics better than continuous ranges; the bimodal + distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the + range guidance is being collapsed to a binary. If no value above fits, mark + the edge AMBIGUOUS rather than picking 0.4 or below. +- AMBIGUOUS edges: 0.1-0.3 + +Node ID format: lowercase, only `[a-z0-9_]`, no dots or slashes. Format: `{stem}_{entity}` where stem is `{parent_dir}_{filename_without_ext}` (the **immediate** parent directory name + the filename stem, both lowercased with non-alphanumeric chars replaced by `_`) and entity is the symbol name similarly normalized. Only one level of parent is used — not the full path. Examples: `src/auth/session.py` + `ValidateToken` → `auth_session_validatetoken`; `lib/utils/helpers.py` + `parse_url` → `utils_helpers_parse_url`; `tests/test_foo.py` + `_helper` → `tests_test_foo_helper`. Top-level files (no parent dir, e.g. `setup.py`) use just the filename stem: `setup_my_func`. This must match the ID the AST extractor generates — using just the filename (e.g., `session_validatetoken`) or the full path (e.g., `src_auth_session_validatetoken`) will create orphan ghost-duplicate nodes. If you are re-extracting a project that had ghost duplicates under the old format, the user should run `graphify extract --force` to rebuild cleanly. CRITICAL: never append chunk numbers, sequence numbers, or any suffix to an ID (no `_c1`, `_c2`, `_chunk2`, etc.). IDs must be deterministic from the label alone — the same entity must always produce the same ID regardless of which chunk processes it. + +Generate the extraction JSON matching this schema exactly: +{"nodes":[{"id":"auth_session_validatetoken","label":"Human Readable Name","file_type":"code|document|paper|image|rationale|concept","source_file":"","source_location":null,"source_url":null,"captured_at":null,"author":null,"contributor":null}],"edges":[{"source":"node_id","target":"node_id","relation":"calls|implements|references|cites|conceptually_related_to|shares_data_with|semantically_similar_to|rationale_for","confidence":"EXTRACTED|INFERRED|AMBIGUOUS","confidence_score":1.0,"source_file":"","source_location":null,"weight":1.0}],"hyperedges":[{"id":"snake_case_id","label":"Human Readable Label","nodes":["node_id1","node_id2","node_id3"],"relation":"participate_in|implement|form","confidence":"EXTRACTED|INFERRED","confidence_score":0.75,"source_file":""}],"input_tokens":0,"output_tokens":0} + +source_file RULE (every node, edge, and hyperedge): set source_file to the path of the originating file EXACTLY as it appears in FILE_LIST — verbatim and absolute. Do NOT shorten to a basename, do NOT re-relativize, do NOT strip any directory prefix, and do NOT change separators (the engine canonicalizes separators and relativizes against the build root downstream). Copy the FILE_LIST entry character-for-character. This keeps the full build and incremental --update on the same base, so build_merge's replace-on-re-extract matches the existing node instead of accumulating a duplicate. + +Then write the JSON to disk using the Write tool at this exact absolute path (no relative paths — Write resolves relative paths against an undefined cwd and the file will be silently lost): +CHUNK_PATH +``` diff --git a/skills/graphify/references/github-and-merge.md b/skills/graphify/references/github-and-merge.md new file mode 100644 index 0000000..a41ea06 --- /dev/null +++ b/skills/graphify/references/github-and-merge.md @@ -0,0 +1,46 @@ +# graphify reference: GitHub clone and cross-repo merge + +Load this when the user passed one or more `https://github.com/...` URLs, or named several local subfolders to merge into one graph. + +### Step 0 - Clone GitHub repo(s) (only if a GitHub URL was given) + +**Single repo:** +```bash +LOCAL_PATH=$(graphify clone [--branch ]) +# Use LOCAL_PATH as the target for all subsequent steps +``` + +**Multiple repos (cross-repo graph):** +```bash +# Clone each repo, run the full pipeline on each, then merge +graphify clone # → ~/.graphify/repos// +graphify clone # → ~/.graphify/repos// +# Run /graphify on each local path to produce their graph.json files +# Then merge: +graphify merge-graphs \ + ~/.graphify/repos///graphify-out/graph.json \ + ~/.graphify/repos///graphify-out/graph.json \ + --out graphify-out/cross-repo-graph.json +``` + +Graphify clones into `~/.graphify/repos//` and reuses existing clones on repeat runs. Each node in the merged graph carries a `repo` attribute so you can filter by origin. + +**Multiple local subfolders (monorepo or multi-service layout):** + +The skill pipeline writes all intermediate and final outputs to `graphify-out/` in the current working directory. Running the skill on each subfolder separately will clobber the same output dir. Instead, use the CLI directly for each subfolder — it places `graphify-out/` *inside* the scanned path: + +```bash +graphify extract ./core/ # → ./core/graphify-out/graph.json +graphify extract ./service/ # → ./service/graphify-out/graph.json +graphify extract ./platform/ # → ./platform/graphify-out/graph.json +# Add --backend gemini|kimi|openai|deepseek|claude-cli depending on which API key you have set + +# Then merge at the project root: +graphify merge-graphs \ + ./core/graphify-out/graph.json \ + ./service/graphify-out/graph.json \ + ./platform/graphify-out/graph.json \ + --out graphify-out/graph.json +``` + +Once `graphify-out/graph.json` exists, the fast path above takes over: any codebase question runs `graphify query` directly on the merged graph — no re-extraction, no size gate. diff --git a/skills/graphify/references/hooks.md b/skills/graphify/references/hooks.md new file mode 100644 index 0000000..438b8b1 --- /dev/null +++ b/skills/graphify/references/hooks.md @@ -0,0 +1,33 @@ +# graphify reference: commit hook and native CLAUDE.md integration + +Load this when the user asked to install the post-commit hook or wire graphify into a project's CLAUDE.md. + +## For git commit hook + +Install a post-commit hook that auto-rebuilds the graph after every commit. No background process needed - triggers once per commit, works with any editor. + +```bash +graphify hook install # install +graphify hook uninstall # remove +graphify hook status # check +``` + +After every `git commit`, the hook detects which code files changed (via `git diff HEAD~1`), re-runs AST extraction on those files, and rebuilds `graph.json` and `GRAPH_REPORT.md`. Doc/image changes are ignored by the hook - run `/graphify --update` manually for those. + +If a post-commit hook already exists, graphify appends to it rather than replacing it. + +--- + +## For native CLAUDE.md integration + +Run once per project to make graphify always-on in Claude Code sessions: + +```bash +graphify claude install +``` + +This writes a `## graphify` section to the local `CLAUDE.md` that instructs Claude to check the graph before answering codebase questions and rebuild it after code changes. No manual `/graphify` needed in future sessions. + +```bash +graphify claude uninstall # remove the section +``` diff --git a/skills/graphify/references/query.md b/skills/graphify/references/query.md new file mode 100644 index 0000000..3ed5f65 --- /dev/null +++ b/skills/graphify/references/query.md @@ -0,0 +1,303 @@ +# graphify reference: query, path, explain + +Load this when the user asks a question against an existing graph, or runs `/graphify path` or `/graphify explain`. The core's query stub points here for the full traversal flow. These flows use the `graphify query` CLI when it is available and fall back to an inline NetworkX traversal otherwise. + +Two traversal modes - choose based on the question: + +| Mode | Flag | Best for | +|------|------|----------| +| BFS (default) | _(none)_ | "What is X connected to?" - broad context, nearest neighbors first | +| DFS | `--dfs` | "How does X reach Y?" - trace a specific chain or dependency path | + +First check the graph exists: +```bash +$(cat graphify-out/.graphify_python) -c " +from pathlib import Path +if not Path('graphify-out/graph.json').exists(): + print('ERROR: No graph found. Run /graphify first to build the graph.') + raise SystemExit(1) +" +``` +If it fails, stop and tell the user to run `/graphify ` first. + +### Step 0 — Constrained query expansion (REQUIRED before traversal) + +graphify's `query` CLI matches nodes via case-folded substring + IDF — there is **no stemming, no synonyms, no cross-language match** inside the binary, and the inline fallback below matches the same way. If the user's question uses different language or different domain vocabulary than the graph's labels (user says "обработчик" / graph says "handler"; user says "authentication" / graph says "Guardian"), the literal matcher returns 0 hits and the answer collapses to noise. + +Fix this **without inventing tokens** by expanding the query against the actual graph vocabulary first: + +1. Extract the token vocabulary from node labels: +```bash +$(cat graphify-out/.graphify_python) -c " +import json, re +from pathlib import Path +data = json.loads(Path('graphify-out/graph.json').read_text()) +vocab = set() +for n in data['nodes']: + for c in re.findall(r'[^\W\d_]+', n.get('label','') or '', re.UNICODE): + parts = re.findall(r'[A-Z]+(?=[A-Z][a-z])|[A-Z]?[a-z]+|[A-Z]+', c) or [c] + for p in parts: + t = p.lower() + if 3 <= len(t) <= 30: + vocab.add(t) +Path('graphify-out/.vocab.txt').write_text('\n'.join(sorted(vocab))) +print(f'vocab: {len(vocab)} tokens') +" +``` + +2. Read `graphify-out/.vocab.txt`. Then for the user's question, select **up to 12 tokens from this exact list** that semantically match the query intent. Hard constraints: + - You MUST pick only tokens present in the vocabulary file. Do NOT invent tokens. + - If a query concept has no plausible token in the vocab, skip it — do not substitute a near-synonym from training memory. + - If **no** vocab tokens match the query at all, output an empty list and tell the user the corpus has no relevant vocabulary for this question. Do not fabricate a search. + - Translate cross-language: Russian "аутентификация" → look for `auth`, `credential`, `token`, `security` IFF present in vocab. + - Morphology: "handlers" maps to `handler` IFF present; "todos" maps to `todo` IFF present. + +3. Print the selection explicitly to the user before running the query, so the expansion is auditable: +``` +Query expanded to (from graph vocab, N tokens): [token1, token2, ...] +``` +If the list is empty, say so plainly and stop — do not proceed to traversal. + +### Step 1 — Traversal + +Build the **expanded query string** by joining the selected tokens with spaces. Use this string as `QUESTION` below — NOT the original user question. (The original question is preserved only for `save-result` at the end.) + +Prefer the CLI when it is installed: +```bash +graphify query "QUESTION" +# or: graphify query "QUESTION" --dfs --budget 3000 +``` + +If the CLI is unavailable, load `graphify-out/graph.json` and run the traversal inline: + +1. Find the 1-3 nodes whose label best matches the expanded tokens. +2. Run the appropriate traversal from each starting node. +3. Read the subgraph - node labels, edge relations, confidence tags, source locations. +4. Answer using **only** what the graph contains. Quote `source_location` when citing a specific fact. +5. If the graph lacks enough information, say so - do not hallucinate edges. + +```bash +$(cat graphify-out/.graphify_python) -c " +import sys, json +from networkx.readwrite import json_graph +import networkx as nx +from pathlib import Path + +data = json.loads(Path('graphify-out/graph.json').read_text()) +G = json_graph.node_link_graph(data, edges='links') + +question = 'QUESTION' +mode = 'MODE' # 'bfs' or 'dfs' +terms = [t.lower() for t in question.split() if len(t) >= 3] # match the vocab threshold; keeps api/jwt/ios (#1392) + +# Find best-matching start nodes +scored = [] +for nid, ndata in G.nodes(data=True): + label = ndata.get('label', '').lower() + score = sum(1 for t in terms if t in label) + if score > 0: + scored.append((score, nid)) +scored.sort(reverse=True) +start_nodes = [nid for _, nid in scored[:3]] + +if not start_nodes: + print('No matching nodes found for query terms:', terms) + sys.exit(0) + +subgraph_nodes = set() +subgraph_edges = [] + +if mode == 'dfs': + # DFS: follow one path as deep as possible before backtracking. + # Depth-limited to 6 to avoid traversing the whole graph. + visited = set() + stack = [(n, 0) for n in reversed(start_nodes)] + while stack: + node, depth = stack.pop() + if node in visited or depth > 6: + continue + visited.add(node) + subgraph_nodes.add(node) + for neighbor in G.neighbors(node): + if neighbor not in visited: + stack.append((neighbor, depth + 1)) + subgraph_edges.append((node, neighbor)) +else: + # BFS: explore all neighbors layer by layer up to depth 3. + frontier = set(start_nodes) + subgraph_nodes = set(start_nodes) + for _ in range(3): + next_frontier = set() + for n in frontier: + for neighbor in G.neighbors(n): + if neighbor not in subgraph_nodes: + next_frontier.add(neighbor) + subgraph_edges.append((n, neighbor)) + subgraph_nodes.update(next_frontier) + frontier = next_frontier + +# Token-budget aware output: rank by relevance, cut at budget (~4 chars/token) +token_budget = BUDGET # default 2000 +char_budget = token_budget * 4 + +# Score each node by term overlap for ranked output +def relevance(nid): + label = G.nodes[nid].get('label', '').lower() + return sum(1 for t in terms if t in label) + +ranked_nodes = sorted(subgraph_nodes, key=relevance, reverse=True) + +lines = [f'Traversal: {mode.upper()} | Start: {[G.nodes[n].get(\"label\",n) for n in start_nodes]} | {len(subgraph_nodes)} nodes'] +for nid in ranked_nodes: + d = G.nodes[nid] + lines.append(f' NODE {d.get(\"label\", nid)} [src={d.get(\"source_file\",\"\")} loc={d.get(\"source_location\",\"\")}]') +for u, v in subgraph_edges: + if u in subgraph_nodes and v in subgraph_nodes: + _raw = G[u][v]; d = next(iter(_raw.values()), {}) if isinstance(G, nx.MultiGraph) else _raw + lines.append(f' EDGE {G.nodes[u].get(\"label\",u)} --{d.get(\"relation\",\"\")} [{d.get(\"confidence\",\"\")}]--> {G.nodes[v].get(\"label\",v)}') + +output = '\n'.join(lines) +if len(output) > char_budget: + output = output[:char_budget] + f'\n... (truncated at ~{token_budget} token budget - use --budget N for more)' +print(output) +" +``` + +Replace `QUESTION` with the **expanded** query string, `MODE` with `bfs` or `dfs`, and `BUDGET` with the token budget (default `2000`, or whatever `--budget N` specifies). Then answer based on the subgraph output above, using only what the graph contains. + +After writing the answer, save it back into the graph so it improves future queries. Include the expanded tokens inside the `--answer` text (e.g. `"Expanded from original query via vocab: [tokens]. Then traversed..."`) so the next `--update` extracts the expansion history as a graph node: + +```bash +$(cat graphify-out/.graphify_python) -m graphify save-result --question "ORIGINAL_QUESTION" --answer "ANSWER" --type query --nodes NODE1 NODE2 +``` + +Replace `ORIGINAL_QUESTION` with the user's verbatim question, `ANSWER` with your full answer text (containing the expanded-token trace), `NODE1 NODE2` with the list of node labels you cited. This closes the feedback loop: the next `--update` will extract this Q&A as a node in the graph. + +--- + +## For /graphify path + +Find the shortest path between two named concepts in the graph. Prefer the CLI when installed: + +```bash +graphify path "NODE_A" "NODE_B" +``` + +If the CLI is unavailable, run it inline: + +```bash +$(cat graphify-out/.graphify_python) -c " +import json, sys +import networkx as nx +from networkx.readwrite import json_graph +from pathlib import Path + +data = json.loads(Path('graphify-out/graph.json').read_text()) +G = json_graph.node_link_graph(data, edges='links') + +a_term = 'NODE_A' +b_term = 'NODE_B' + +def find_node(term): + term = term.lower() + scored = sorted( + [(sum(1 for w in term.split() if w in G.nodes[n].get('label','').lower()), n) + for n in G.nodes()], + reverse=True + ) + return scored[0][1] if scored and scored[0][0] > 0 else None + +src = find_node(a_term) +tgt = find_node(b_term) + +if not src or not tgt: + print(f'Could not find nodes matching: {a_term!r} or {b_term!r}') + sys.exit(0) + +try: + path = nx.shortest_path(G, src, tgt) + print(f'Shortest path ({len(path)-1} hops):') + for i, nid in enumerate(path): + label = G.nodes[nid].get('label', nid) + if i < len(path) - 1: + _raw = G[nid][path[i+1]]; edge = next(iter(_raw.values()), {}) if isinstance(G, nx.MultiGraph) else _raw + rel = edge.get('relation', '') + conf = edge.get('confidence', '') + print(f' {label} --{rel}--> [{conf}]') + else: + print(f' {label}') +except nx.NetworkXNoPath: + print(f'No path found between {a_term!r} and {b_term!r}') +except nx.NodeNotFound as e: + print(f'Node not found: {e}') +" +``` + +Replace `NODE_A` and `NODE_B` with the actual concept names from the user. Then explain the path in plain language - what each hop means, why it's significant. + +After writing the explanation, save it back: + +```bash +$(cat graphify-out/.graphify_python) -m graphify save-result --question "Path from NODE_A to NODE_B" --answer "ANSWER" --type path_query --nodes NODE_A NODE_B +``` + +--- + +## For /graphify explain + +Give a plain-language explanation of a single node - everything connected to it. Prefer the CLI when installed: + +```bash +graphify explain "NODE_NAME" +``` + +If the CLI is unavailable, run it inline: + +```bash +$(cat graphify-out/.graphify_python) -c " +import json, sys +import networkx as nx +from networkx.readwrite import json_graph +from pathlib import Path + +data = json.loads(Path('graphify-out/graph.json').read_text()) +G = json_graph.node_link_graph(data, edges='links') + +term = 'NODE_NAME' +term_lower = term.lower() + +# Find best matching node +scored = sorted( + [(sum(1 for w in term_lower.split() if w in G.nodes[n].get('label','').lower()), n) + for n in G.nodes()], + reverse=True +) +if not scored or scored[0][0] == 0: + print(f'No node matching {term!r}') + sys.exit(0) + +nid = scored[0][1] +data_n = G.nodes[nid] +print(f'NODE: {data_n.get(\"label\", nid)}') +print(f' source: {data_n.get(\"source_file\",\"unknown\")}') +print(f' type: {data_n.get(\"file_type\",\"unknown\")}') +print(f' degree: {G.degree(nid)}') +print() +print('CONNECTIONS:') +for neighbor in G.neighbors(nid): + _raw = G[nid][neighbor]; edge = next(iter(_raw.values()), {}) if isinstance(G, nx.MultiGraph) else _raw + nlabel = G.nodes[neighbor].get('label', neighbor) + rel = edge.get('relation', '') + conf = edge.get('confidence', '') + src_file = G.nodes[neighbor].get('source_file', '') + print(f' --{rel}--> {nlabel} [{conf}] ({src_file})') +" +``` + +Replace `NODE_NAME` with the concept the user asked about. Then write a 3-5 sentence explanation of what this node is, what it connects to, and why those connections are significant. Use the source locations as citations. + +After writing the explanation, save it back: + +```bash +$(cat graphify-out/.graphify_python) -m graphify save-result --question "Explain NODE_NAME" --answer "ANSWER" --type explain --nodes NODE_NAME +``` diff --git a/skills/graphify/references/transcribe.md b/skills/graphify/references/transcribe.md new file mode 100644 index 0000000..b967f83 --- /dev/null +++ b/skills/graphify/references/transcribe.md @@ -0,0 +1,52 @@ +# graphify reference: transcribe video and audio + +Load this only when `detect` reported one or more `video` files. A corpus with no video never reads this. + +### Step 2.5 - Transcribe video / audio files (only if video files detected) + +Skip this step entirely if `detect` returned zero `video` files. + +Video and audio files cannot be read directly. Transcribe them to text first, then treat the transcripts as doc files in Step 3. + +**Strategy:** Read the god nodes from `graphify-out/.graphify_detect.json` (or the analysis file if it exists from a previous run). You are already a language model — write a one-sentence domain hint yourself from those labels. Then pass it to Whisper as the initial prompt. No separate API call needed. + +**However**, if the corpus has *only* video files and no other docs/code, use the generic fallback prompt: `"Use proper punctuation and paragraph breaks."` + +**Step 1 - Write the Whisper prompt yourself.** + +Read the top god node labels from detect output or analysis, then compose a short domain hint sentence, for example: + +- Labels: `transformer, attention, encoder, decoder` → `"Machine learning research on transformer architectures and attention mechanisms. Use proper punctuation and paragraph breaks."` +- Labels: `kubernetes, deployment, pod, helm` → `"DevOps discussion about Kubernetes deployments and Helm charts. Use proper punctuation and paragraph breaks."` + +**Export** it as `GRAPHIFY_WHISPER_PROMPT` (the exact name the transcriber reads — and it must be `export`ed so the child Python process sees it) for the next command. + +**Step 2 - Transcribe:** + +```bash +export GRAPHIFY_WHISPER_MODEL=base # or whatever --whisper-model the user passed (must be exported) +export GRAPHIFY_WHISPER_PROMPT="" +$(cat graphify-out/.graphify_python) -c " +import json, os, sys +from pathlib import Path +from graphify.transcribe import transcribe_all + +detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text(encoding=\"utf-8\")) +video_files = detect.get('files', {}).get('video', []) +prompt = os.environ.get('GRAPHIFY_WHISPER_PROMPT', 'Use proper punctuation and paragraph breaks.') + +transcript_paths = transcribe_all(video_files, initial_prompt=prompt) +# Write the JSON from Python (NOT a shell '>' redirect): transcribe_all/Whisper +# print progress to stdout, which would otherwise corrupt the JSON file (#1392). +Path('graphify-out/.graphify_transcripts.json').write_text(json.dumps(transcript_paths, ensure_ascii=False), encoding=\"utf-8\") +print(f'Transcribed {len(transcript_paths)} file(s)', file=sys.stderr) +" +``` + +After transcription: +- Read the transcript paths from `graphify-out/.graphify_transcripts.json` +- Add them to the docs list before dispatching semantic subagents in Step 3B +- Print how many transcripts were created: `Transcribed N video file(s) -> treating as docs` +- If transcription fails for a file, print a warning and continue with the rest + +**Whisper model:** Default is `base`. If the user passed `--whisper-model `, `export GRAPHIFY_WHISPER_MODEL=` (it must be exported, not just assigned) before running the command above. diff --git a/skills/graphify/references/update.md b/skills/graphify/references/update.md new file mode 100644 index 0000000..fa26121 --- /dev/null +++ b/skills/graphify/references/update.md @@ -0,0 +1,192 @@ +# graphify reference: incremental update and cluster-only + +Load this only when the user passed `--update` or `--cluster-only`. A first-time full build never reads this file. + +## For --update (incremental re-extraction) + +Use when you've added or modified files since the last run. Only re-extracts changed files - saves tokens and time. + +```bash +$(cat graphify-out/.graphify_python) -c " +import sys, json +from graphify.detect import detect_incremental, save_manifest +from pathlib import Path + +result = detect_incremental(Path('INPUT_PATH')) +new_total = result.get('new_total', 0) +print(json.dumps(result, indent=2, ensure_ascii=False)) +Path('graphify-out/.graphify_incremental.json').write_text(json.dumps(result, ensure_ascii=False), encoding=\"utf-8\") +deleted = list(result.get('deleted_files', [])) +if new_total == 0 and not deleted: + print('No files changed since last run. Nothing to update.') + raise SystemExit(0) +if deleted: + print(f'{len(deleted)} deleted file(s) to prune.') +if new_total > 0: + print(f'{new_total} new/changed file(s) to re-extract.') +" +``` + +Then populate `.graphify_detect.json` so Steps 3A–6 (which read it unconditionally) see the right state for an incremental run. `files` carries the changed subset (drives Step 3A AST + Step 3B0 cache check on only what changed); `all_files` carries the full corpus for any step that needs corpus-wide context: + +```bash +$(cat graphify-out/.graphify_python) -c " +import json +from pathlib import Path +r = json.loads(Path('graphify-out/.graphify_incremental.json').read_text(encoding=\"utf-8\")) +Path('graphify-out/.graphify_detect.json').write_text(json.dumps({ + 'files': r.get('new_files', {}), + 'all_files': r.get('files', {}), + 'total_files': r.get('new_total', 0), + 'total_words': r.get('total_words', 0), + 'skipped_sensitive': r.get('skipped_sensitive', []), + 'needs_graph': True, +}, ensure_ascii=False), encoding=\"utf-8\") +" +``` + +If new files exist, first check whether all changed files are code files: + +```bash +$(cat graphify-out/.graphify_python) -c " +import json +from pathlib import Path + +result = json.loads(open('graphify-out/.graphify_incremental.json', encoding='utf-8').read()) if Path('graphify-out/.graphify_incremental.json').exists() else {} +code_exts = {'.py','.ts','.js','.go','.rs','.java','.cpp','.c','.rb','.swift','.kt','.cs','.scala','.php','.cc','.cxx','.hpp','.h','.kts','.lua','.toc','.f','.F','.f90','.F90','.f95','.F95','.f03','.F03','.f08','.F08'} +new_files = result.get('new_files', {}) +all_changed = [f for files in new_files.values() for f in files] +code_only = all(Path(f).suffix.lower() in code_exts for f in all_changed) +print('code_only:', code_only) +" +``` + +If `code_only` is True: print `[graphify update] Code-only changes detected - skipping semantic extraction (no LLM needed)`, run only Step 3A (AST) on the changed files, skip Step 3B entirely (no subagents), then go straight to merge and Steps 4–8. + +If `code_only` is False (any changed file is a doc/paper/image/video): **first, if any changed file is in `new_files['video']`, run `references/transcribe.md` (Step 2.5) on those files, then rewrite `.graphify_detect.json` to move the resulting transcript paths into `files['document']` and drop `files['video']`** — otherwise raw `.mp4/.mp3` paths are fed to semantic subagents as unreadable media (#1392). Then run the full Steps 3A–3C pipeline as normal. + + +If no new files exist (only deletions), create an empty extraction so the merge step can prune: + +```bash +if [ ! -f graphify-out/.graphify_extract.json ]; then + echo '[graphify update] Only deletions -- creating empty extraction for merge.' + $(cat graphify-out/.graphify_python) -c " +import json +from pathlib import Path +Path('graphify-out/.graphify_extract.json').write_text(json.dumps({'nodes':[],'edges':[],'hyperedges':[],'input_tokens':0,'output_tokens':0}), encoding='utf-8') +" +fi +``` + + +Then: + +```bash +$(cat graphify-out/.graphify_python) -c " +import json +from pathlib import Path +from graphify.build import build_merge +from graphify.detect import save_manifest + +# Load new extraction and incremental state +new_extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text(encoding=\"utf-8\")) +incremental = json.loads(Path('graphify-out/.graphify_incremental.json').read_text(encoding=\"utf-8\")) +deleted = list(incremental.get('deleted_files', [])) +# prune_sources is ONLY for genuinely DELETED files. Changed/re-extracted files are +# handled by build_merge's replace-on-re-extract (#1344): every source_file in +# new_chunks is dropped from the base before merge, so old/stale nodes don't survive. +# Do NOT add `changed` here: with root= passed, prune_set relativizes to the same base +# as the freshly merged nodes and would DELETE the re-extracted content (#1178 is moot +# now that replace — not the dedup pass — reconciles changed files). +prune = list(deleted) or None + +# Use build_merge() — reads graph.json directly without NetworkX round-trip +# so edge direction (calls, implements, imports) is always preserved (#801). +# Pass root= so prune_sources (absolute paths from detect_incremental) are +# relativized to match the graph's relative source_file values; without it +# nothing is pruned and stale nodes accumulate on every update (#1361). +# directed=IS_DIRECTED: replace IS_DIRECTED with True if --directed was given, else +# False. Without it a --directed --update silently rebuilds undirected and collapses +# reciprocal A<->B edges (#1392). +G = build_merge( + [new_extraction], + graph_path='graphify-out/graph.json', + prune_sources=prune, + root='INPUT_PATH', + directed=IS_DIRECTED, +) +print(f'[graphify update] Merged: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges') + +# Write merged result back to .graphify_extract.json so Step 4 sees the full graph +merged_out = { + 'nodes': [{'id': n, **d} for n, d in G.nodes(data=True)], + 'edges': [ + # Explicit source/target last so they win over any stale attrs in d. + {**{k: val for k, val in d.items() if k not in ('_src', '_tgt', 'source', 'target')}, + 'source': d.get('_src', u), 'target': d.get('_tgt', v)} + for u, v, d in G.edges(data=True) + ], + # G.graph["hyperedges"] holds hyperedges from both existing graph.json + # and new_extraction (build_merge combines them). Falling back to + # new_extraction only would silently drop prior-run hyperedges (#801). + 'hyperedges': list(G.graph.get('hyperedges', [])), + 'input_tokens': new_extraction.get('input_tokens', 0), + 'output_tokens': new_extraction.get('output_tokens', 0), +} +Path('graphify-out/.graphify_extract.json').write_text(json.dumps(merged_out, ensure_ascii=False), encoding=\"utf-8\") +print(f'[graphify update] Merged extraction written ({len(merged_out[\"nodes\"])} nodes, {len(merged_out[\"edges\"])} edges)') + +# Save manifest so next --update diffs against today's state, not the +# prior run's baseline (prevents ghost-node reports on subsequent updates). +# root= matches the build_merge call above so the manifest keys stay relative to +# the scan root — portable across clones/machines, so --update keeps matching +# cached files instead of missing every one after a move (#1417). +save_manifest(incremental['files'], root='INPUT_PATH') +print('[graphify update] Manifest saved.') +" +``` + +Then run Steps 4–8 on the merged graph as normal. + +After Step 4, show the graph diff: + +```bash +$(cat graphify-out/.graphify_python) -c " +import json +from graphify.analyze import graph_diff +from graphify.build import build_from_json +from networkx.readwrite import json_graph +import networkx as nx +from pathlib import Path + +# Load old graph (before update) from backup written before merge +old_data = json.loads(Path('graphify-out/.graphify_old.json').read_text(encoding=\"utf-8\")) if Path('graphify-out/.graphify_old.json').exists() else None +new_extract = json.loads(Path('graphify-out/.graphify_extract.json').read_text(encoding=\"utf-8\")) +G_new = build_from_json(new_extract, directed=IS_DIRECTED) + +if old_data: + G_old = json_graph.node_link_graph(old_data, edges='links') + diff = graph_diff(G_old, G_new) + print(diff['summary']) + if diff['new_nodes']: + print('New nodes:', ', '.join(n['label'] for n in diff['new_nodes'][:5])) + if diff['new_edges']: + print('New edges:', len(diff['new_edges'])) +" +``` + +Before the merge step, save the old graph: `cp graphify-out/graph.json graphify-out/.graphify_old.json` +Clean up after: `rm -f graphify-out/.graphify_old.json` + +--- + +## For --cluster-only + +Skip Steps 1–3. Re-run clustering on the existing graph: + +```bash +graphify cluster-only . +``` + +`graphify cluster-only .` is **self-contained**: it re-clusters, names communities, and regenerates `GRAPH_REPORT.md`, `graph.json`, and `graph.html` from the existing graph. **Do not re-run Steps 5–9** — they read intermediate files (`.graphify_extract.json`, `.graphify_detect.json`, `.graphify_analysis.json`) that a prior build's cleanup (Step 9) already deleted, so they raise `FileNotFoundError` (#1392). When it finishes, present the refreshed `GRAPH_REPORT.md` summary as usual.