chore(graphify): update skill to v0.8.45

Bump 0.8.13 -> 0.8.45. Extract the SKILL.md monolith (~530 lines) into
references/ for progressive disclosure: github-and-merge, transcribe,
extraction-spec, exports, update, query, add-watch, hooks. SKILL.md now
points to each reference and loads it only on the path that needs it.

Inline fixes carried by the new version: empty-extraction guard before
any write (#1392), shrink-guard ordering so GRAPH_REPORT/analysis never
describe a graph.json that was refused (#479), root= relativization for
build/manifest parity across clones (#1361/#1417), stale-cache cleanup
and code-only semantic pre-write (#1392), edge-direction preserving
merge (#801). Adds FalkorDB export (--falkordb/--falkordb-push) and
rewrites the frontmatter description (drops the obsolete trigger: field).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0169vjUD1sP9Nx4ZiCa8wvAw
This commit is contained in:
Bastien Chanot 2026-06-24 14:22:14 +02:00
parent 6516b85f0f
commit ed5b54e87e
10 changed files with 906 additions and 532 deletions

View File

@ -1 +1 @@
0.8.13
0.8.45

View File

@ -1,7 +1,6 @@
---
name: graphify
description: "any input (code, docs, papers, images, videos) to knowledge graph. Use when user asks any question about a codebase, documents, or project content - especially if graphify-out/ exists, treat the question as a /graphify query."
trigger: /graphify
description: "Use for any question about a codebase, its architecture, file relationships, or project content — especially when graphify-out/ exists, where the question should be treated as a graphify query first. Turns any input (code, docs, papers, images, videos) into a persistent knowledge graph with god nodes, community detection, and query/path/explain tools."
---
# /graphify
@ -27,6 +26,8 @@ Turn any folder of files into a navigable knowledge graph with community detecti
/graphify <path> --graphml # export graph.graphml (Gephi, yEd)
/graphify <path> --neo4j # generate graphify-out/cypher.txt for Neo4j
/graphify <path> --neo4j-push bolt://localhost:7687 # push directly to Neo4j
/graphify <path> --falkordb # generate graphify-out/cypher.txt for FalkorDB
/graphify <path> --falkordb-push falkordb://localhost:6379 # push directly to FalkorDB
/graphify <path> --mcp # start MCP stdio server for agent access
/graphify <path> --watch # watch folder, auto-rebuild on code changes (no LLM needed)
/graphify <path> --wiki # build agent-crawlable wiki (index.md + one article per community)
@ -57,48 +58,9 @@ If the path argument starts with `https://github.com/` or `http://github.com/`,
Follow these steps in order. Do not skip steps.
### Step 0 - Clone GitHub repo(s) (only if a GitHub URL was given)
### Step 0 - GitHub repos and multi-path merge (only if a URL or several paths)
**Single repo:**
```bash
LOCAL_PATH=$(graphify clone <github-url> [--branch <branch>])
# Use LOCAL_PATH as the target for all subsequent steps
```
**Multiple repos (cross-repo graph):**
```bash
# Clone each repo, run the full pipeline on each, then merge
graphify clone <url1> # → ~/.graphify/repos/<owner1>/<repo1>
graphify clone <url2> # → ~/.graphify/repos/<owner2>/<repo2>
# Run /graphify on each local path to produce their graph.json files
# Then merge:
graphify merge-graphs \
~/.graphify/repos/<owner1>/<repo1>/graphify-out/graph.json \
~/.graphify/repos/<owner2>/<repo2>/graphify-out/graph.json \
--out graphify-out/cross-repo-graph.json
```
Graphify clones into `~/.graphify/repos/<owner>/<repo>` and reuses existing clones on repeat runs. Each node in the merged graph carries a `repo` attribute so you can filter by origin.
**Multiple local subfolders (monorepo or multi-service layout):**
The skill pipeline writes all intermediate and final outputs to `graphify-out/` in the current working directory. Running the skill on each subfolder separately will clobber the same output dir. Instead, use the CLI directly for each subfolder — it places `graphify-out/` *inside* the scanned path:
```bash
graphify extract ./core/ # → ./core/graphify-out/graph.json
graphify extract ./service/ # → ./service/graphify-out/graph.json
graphify extract ./platform/ # → ./platform/graphify-out/graph.json
# Add --backend gemini|kimi|openai|deepseek|claude-cli depending on which API key you have set
# Then merge at the project root:
graphify merge-graphs \
./core/graphify-out/graph.json \
./service/graphify-out/graph.json \
./platform/graphify-out/graph.json \
--out graphify-out/graph.json
```
Once `graphify-out/graph.json` exists, the fast path above takes over: any codebase question runs `graphify query` directly on the merged graph — no re-extraction, no size gate.
Only when the path is one or more `https://github.com/...` URLs, or several local subfolders to merge. See `references/github-and-merge.md` for the clone, cross-repo merge, and monorepo flow, then continue with the resolved local path. A plain local path skips this step.
### Step 1 - Ensure graphify is installed
@ -179,50 +141,9 @@ Then act on it:
- Otherwise rank by count, show the top 5 with file counts, then ask which subfolder to run on. Wait for the user's answer before proceeding.
- Otherwise: proceed directly to Step 2.5 if video files were detected, or Step 3 if not.
### Step 2.5 - Transcribe video / audio files (only if video files detected)
### Step 2.5 - Video and audio (only if video files detected)
Skip this step entirely if `detect` returned zero `video` files.
Video and audio files cannot be read directly. Transcribe them to text first, then treat the transcripts as doc files in Step 3.
**Strategy:** Read the god nodes from `graphify-out/.graphify_detect.json` (or the analysis file if it exists from a previous run). You are already a language model — write a one-sentence domain hint yourself from those labels. Then pass it to Whisper as the initial prompt. No separate API call needed.
**However**, if the corpus has *only* video files and no other docs/code, use the generic fallback prompt: `"Use proper punctuation and paragraph breaks."`
**Step 1 - Write the Whisper prompt yourself.**
Read the top god node labels from detect output or analysis, then compose a short domain hint sentence, for example:
- Labels: `transformer, attention, encoder, decoder``"Machine learning research on transformer architectures and attention mechanisms. Use proper punctuation and paragraph breaks."`
- Labels: `kubernetes, deployment, pod, helm``"DevOps discussion about Kubernetes deployments and Helm charts. Use proper punctuation and paragraph breaks."`
Set it as `WHISPER_PROMPT` to use in the next command.
**Step 2 - Transcribe:**
```bash
GRAPHIFY_WHISPER_MODEL=base # or whatever --whisper-model the user passed
$(cat graphify-out/.graphify_python) -c "
import json, os
from pathlib import Path
from graphify.transcribe import transcribe_all
detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text(encoding=\"utf-8\"))
video_files = detect.get('files', {}).get('video', [])
prompt = os.environ.get('GRAPHIFY_WHISPER_PROMPT', 'Use proper punctuation and paragraph breaks.')
transcript_paths = transcribe_all(video_files, initial_prompt=prompt)
print(json.dumps(transcript_paths, ensure_ascii=False))
" > graphify-out/.graphify_transcripts.json
```
After transcription:
- Read the transcript paths from `graphify-out/.graphify_transcripts.json`
- Add them to the docs list before dispatching semantic subagents in Step 3B
- Print how many transcripts were created: `Transcribed N video file(s) -> treating as docs`
- If transcription fails for a file, print a warning and continue with the rest
**Whisper model:** Default is `base`. If the user passed `--whisper-model <name>`, set `GRAPHIFY_WHISPER_MODEL=<name>` in the environment before running the command above.
Skip this step entirely if `detect` returned zero `video` files. When the corpus has video or audio, see `references/transcribe.md` to transcribe them to text first, then treat the transcripts as doc files in Step 3.
### Step 3 - Extract entities and relationships
@ -269,7 +190,15 @@ else:
#### Part B - Semantic extraction (parallel subagents)
**Fast path:** If detection found zero docs, papers, and images (code-only corpus), skip Part B entirely and go straight to Part C. AST handles code - there is nothing for semantic subagents to do.
**Fast path:** If detection found zero docs, papers, and images (code-only corpus), skip Part B entirely and go straight to Part C. AST handles code - there is nothing for semantic subagents to do. **First write an empty semantic file** so Part C's merge has its input (it reads `.graphify_semantic.json` unconditionally; without this a code-only run hits `FileNotFoundError`):
```bash
$(cat graphify-out/.graphify_python) -c "
import json
from pathlib import Path
Path('graphify-out/.graphify_semantic.json').write_text(json.dumps({'nodes':[],'edges':[],'hyperedges':[],'input_tokens':0,'output_tokens':0}), encoding='utf-8')
"
```
**MANDATORY: You MUST use the Agent tool here. Reading files yourself one-by-one is forbidden - it is 5-10x slower. If you do not use the Agent tool you are doing this wrong.**
@ -290,12 +219,19 @@ from graphify.cache import check_semantic_cache
from pathlib import Path
detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text(encoding=\"utf-8\"))
all_files = [f for files in detect['files'].values() for f in files]
# Only content files go to semantic extraction. Code is already covered structurally
# by the AST pass (Part A); flattening every category here makes subagents re-read
# every source file (#1392). Video is transcribed to a document in Step 2.5 first.
all_files = [f for cat in ('document', 'paper', 'image') for f in detect['files'].get(cat, [])]
cached_nodes, cached_edges, cached_hyperedges, uncached = check_semantic_cache(all_files)
# Always (re)write the cache file: write hits, else DELETE any leftover from a prior
# run so Part C never merges a stale .graphify_cached.json (#1392).
if cached_nodes or cached_edges or cached_hyperedges:
Path('graphify-out/.graphify_cached.json').write_text(json.dumps({'nodes': cached_nodes, 'edges': cached_edges, 'hyperedges': cached_hyperedges}, ensure_ascii=False), encoding=\"utf-8\")
else:
Path('graphify-out/.graphify_cached.json').unlink(missing_ok=True)
Path('graphify-out/.graphify_uncached.txt').write_text('\n'.join(uncached), encoding=\"utf-8\")
print(f'Cache: {len(all_files)-len(uncached)} files hit, {len(uncached)} files need extraction')
"
@ -325,76 +261,13 @@ Each subagent receives this exact prompt (substitute FILE_LIST, CHUNK_NUM, TOTAL
CHUNK_PATH must be an **absolute** path — derive it before dispatching:
```bash
PROJECT_ROOT=$(cat graphify-out/.graphify_root)
PROJECT_ROOT=$(pwd) # cwd — where Part C globs graphify-out/ (NOT .graphify_root/scan dir, #1392)
# Then for chunk N: CHUNK_PATH="${PROJECT_ROOT}/graphify-out/.graphify_chunk_0N.json"
```
Subagent prompt template:
```
You are a graphify extraction subagent. Read the files listed and extract a knowledge graph fragment.
Output ONLY valid JSON matching the schema below - no explanation, no markdown fences, no preamble.
Files (chunk CHUNK_NUM of TOTAL_CHUNKS):
FILE_LIST
Rules:
- EXTRACTED: relationship explicit in source (import, call, citation, "see §3.2")
- INFERRED: reasonable inference (shared data structure, implied dependency)
- AMBIGUOUS: uncertain - flag for review, do not omit
Code files: focus on semantic edges AST cannot find (call relationships, shared data, arch patterns).
Do not re-extract imports - AST already has those.
Doc/paper files: extract named concepts, entities, citations. For rationale (WHY decisions were made, trade-offs, design intent): store as a `rationale` attribute on the relevant concept node — do NOT create a separate rationale node or fragment node. Only create a node for something that is itself a named entity or concept. Use `file_type:"rationale"` for concept-like nodes (ideas, principles, mechanisms, design patterns). `file_type` MUST be one of exactly these six values: `code`, `document`, `paper`, `image`, `rationale`, `concept`. Any other value is invalid and will be rejected.
Code files: when adding `calls` edges, source MUST be the caller (the function/class doing the calling), target MUST be the callee. Never reverse this direction.
Image files: use vision to understand what the image IS - do not just OCR.
UI screenshot: layout patterns, design decisions, key elements, purpose.
Chart: metric, trend/insight, data source.
Tweet/post: claim as node, author, concepts mentioned.
Diagram: components and connections.
Research figure: what it demonstrates, method, result.
Handwritten/whiteboard: ideas and arrows, mark uncertain readings AMBIGUOUS.
DEEP_MODE (if --mode deep was given): be aggressive with INFERRED edges - indirect deps,
shared assumptions, latent couplings. Mark uncertain ones AMBIGUOUS instead of omitting.
Semantic similarity: if two concepts in this chunk solve the same problem or represent the same idea without any structural link (no import, no call, no citation), add a `semantically_similar_to` edge marked INFERRED with a confidence_score reflecting how similar they are (0.6-0.95). Examples:
- Two functions that both validate user input but never call each other
- A class in code and a concept in a paper that describe the same algorithm
- Two error types that handle the same failure mode differently
Only add these when the similarity is genuinely non-obvious and cross-cutting. Do not add them for trivially similar things.
Hyperedges: if 3 or more nodes clearly participate together in a shared concept, flow, or pattern that is not captured by pairwise edges alone, add a hyperedge to a top-level `hyperedges` array. Examples:
- All classes that implement a common protocol or interface
- All functions in an authentication flow (even if they don't all call each other)
- All concepts from a paper section that form one coherent idea
Use sparingly — only when the group relationship adds information beyond the pairwise edges. Maximum 3 hyperedges per chunk.
If a file has YAML frontmatter (--- ... ---), copy source_url, captured_at, author,
contributor onto every node from that file.
confidence_score is REQUIRED on every edge - never omit it, never use 0.5 as a default:
- EXTRACTED edges: confidence_score = 1.0 always
- INFERRED edges: pick exactly ONE value from this set — never 0.5:
0.95 direct structural evidence (shared data structure, named cross-file reference).
0.85 strong inference (clear functional alignment, no direct symbol link).
0.75 reasonable inference (shared problem domain + similar shape, requires interpretation).
0.65 weak inference (thematically related, no shape evidence).
0.55 speculative but plausible (surface-level co-occurrence only).
Models follow discrete rubrics better than continuous ranges; the bimodal
distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the
range guidance is being collapsed to a binary. If no value above fits, mark
the edge AMBIGUOUS rather than picking 0.4 or below.
- AMBIGUOUS edges: 0.1-0.3
Node ID format: lowercase, only `[a-z0-9_]`, no dots or slashes. Format: `{stem}_{entity}` where stem is `{parent_dir}_{filename_without_ext}` (the **immediate** parent directory name + the filename stem, both lowercased with non-alphanumeric chars replaced by `_`) and entity is the symbol name similarly normalized. Only one level of parent is used — not the full path. Examples: `src/auth/session.py` + `ValidateToken``auth_session_validatetoken`; `lib/utils/helpers.py` + `parse_url``utils_helpers_parse_url`; `tests/test_foo.py` + `_helper``tests_test_foo_helper`. Top-level files (no parent dir, e.g. `setup.py`) use just the filename stem: `setup_my_func`. This must match the ID the AST extractor generates — using just the filename (e.g., `session_validatetoken`) or the full path (e.g., `src_auth_session_validatetoken`) will create orphan ghost-duplicate nodes. If you are re-extracting a project that had ghost duplicates under the old format, the user should run `graphify extract --force` to rebuild cleanly. CRITICAL: never append chunk numbers, sequence numbers, or any suffix to an ID (no `_c1`, `_c2`, `_chunk2`, etc.). IDs must be deterministic from the label alone — the same entity must always produce the same ID regardless of which chunk processes it.
Generate the extraction JSON matching this schema exactly:
{"nodes":[{"id":"session_validatetoken","label":"Human Readable Name","file_type":"code|document|paper|image|rationale|concept","source_file":"relative/path","source_location":null,"source_url":null,"captured_at":null,"author":null,"contributor":null}],"edges":[{"source":"node_id","target":"node_id","relation":"calls|implements|references|cites|conceptually_related_to|shares_data_with|semantically_similar_to|rationale_for","confidence":"EXTRACTED|INFERRED|AMBIGUOUS","confidence_score":1.0,"source_file":"relative/path","source_location":null,"weight":1.0}],"hyperedges":[{"id":"snake_case_id","label":"Human Readable Label","nodes":["node_id1","node_id2","node_id3"],"relation":"participate_in|implement|form","confidence":"EXTRACTED|INFERRED","confidence_score":0.75,"source_file":"relative/path"}],"input_tokens":0,"output_tokens":0}
Then write the JSON to disk using the Write tool at this exact absolute path (no relative paths — Write resolves relative paths against an undefined cwd and the file will be silently lost):
CHUNK_PATH
```
See `references/extraction-spec.md` for the exact subagent prompt (JSON schema, node-ID rules, confidence rubric, frontmatter, hyperedge, and vision rules). Load it only here, only when at least one chunk holds a doc, paper, or image; a pure-code corpus has skipped Part B and never reads it. Pass each subagent that prompt verbatim with FILE_LIST, CHUNK_NUM, TOTAL_CHUNKS, DEEP_MODE, and CHUNK_PATH substituted, and have it write the result to CHUNK_PATH.
**Step B3 - Collect, cache, and merge**
@ -511,7 +384,7 @@ print(f'Merged: {total} nodes, {edges} edges ({len(ast[\"nodes\"])} AST + {len(s
### Step 4 - Build graph, cluster, analyze, generate outputs
**Before starting:** note whether `--directed` was given. If so, pass `directed=True` to `build_from_json()` in the code block below. This builds a `DiGraph` that preserves edge direction (source→target) instead of the default undirected `Graph`.
**Before starting:** the code blocks below pass `directed=IS_DIRECTED` to `build_from_json()`. Replace `IS_DIRECTED` with `True` if `--directed` was given (builds a `DiGraph` preserving edge direction source→target), otherwise `False` (the default undirected `Graph`). Substitute it the same way you substitute `INPUT_PATH` — do not leave the literal `IS_DIRECTED` in the code.
```bash
mkdir -p graphify-out
@ -527,7 +400,15 @@ from pathlib import Path
extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text(encoding=\"utf-8\"))
detection = json.loads(Path('graphify-out/.graphify_detect.json').read_text(encoding=\"utf-8\"))
G = build_from_json(extraction)
# root= mirrors the --update runbook (#1361): relativize source_file to the same
# base so the full build and incremental --update never drift apart on re-extract.
G = build_from_json(extraction, root='INPUT_PATH', directed=IS_DIRECTED)
# Guard BEFORE any write: an empty extraction must not clobber a good graph.json /
# GRAPH_REPORT.md / analysis sidecar. Check immediately after build (#1392).
if G.number_of_nodes() == 0:
print('ERROR: Graph is empty - extraction produced no nodes.')
print('Possible causes: all files were skipped, binary-only corpus, or extraction failed.')
raise SystemExit(1)
communities = cluster(G)
cohesion = score_all(G, communities)
tokens = {'input': extraction.get('input_tokens', 0), 'output': extraction.get('output_tokens', 0)}
@ -537,10 +418,17 @@ labels = {cid: 'Community ' + str(cid) for cid in communities}
# Placeholder questions - regenerated with real labels in Step 5
questions = suggest_questions(G, communities, labels)
# Export FIRST and honor the #479 shrink-guard: to_json returns False (writing
# nothing) when the new graph is smaller than the existing graph.json. Only write
# GRAPH_REPORT.md + the analysis sidecar when the graph was actually written, so
# they never describe a graph that graph.json doesn't contain (#1392).
wrote = to_json(G, communities, 'graphify-out/graph.json')
if not wrote:
print('ERROR: refused to shrink graphify-out/graph.json (existing graph has more nodes; #479).')
print('If this shrink is intentional (you deleted files), re-run a full build with --force.')
raise SystemExit(1)
report = generate(G, communities, cohesion, labels, gods, surprises, detection, tokens, 'INPUT_PATH', suggested_questions=questions)
Path('graphify-out/GRAPH_REPORT.md').write_text(report, encoding=\"utf-8\")
to_json(G, communities, 'graphify-out/graph.json')
analysis = {
'communities': {str(k): v for k, v in communities.items()},
'cohesion': {str(k): v for k, v in cohesion.items()},
@ -549,10 +437,6 @@ analysis = {
'questions': questions,
}
Path('graphify-out/.graphify_analysis.json').write_text(json.dumps(analysis, indent=2, ensure_ascii=False), encoding=\"utf-8\")
if G.number_of_nodes() == 0:
print('ERROR: Graph is empty - extraction produced no nodes.')
print('Possible causes: all files were skipped, binary-only corpus, or extraction failed.')
raise SystemExit(1)
print(f'Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges, {len(communities)} communities')
"
```
@ -580,7 +464,8 @@ extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text(en
detection = json.loads(Path('graphify-out/.graphify_detect.json').read_text(encoding=\"utf-8\"))
analysis = json.loads(Path('graphify-out/.graphify_analysis.json').read_text(encoding=\"utf-8\"))
G = build_from_json(extraction)
# root= as in Step 4 / the --update runbook (#1361) — same base for node-key parity.
G = build_from_json(extraction, root='INPUT_PATH', directed=IS_DIRECTED)
communities = {int(k): v for k, v in analysis['communities'].items()}
cohesion = {int(k): v for k, v in analysis['cohesion'].items()}
tokens = {'input': extraction.get('input_tokens', 0), 'output': extraction.get('output_tokens', 0)}
@ -621,73 +506,9 @@ graphify export html # auto-aggregates to community view if graph > 5000 nodes
# or: graphify export html --no-viz
```
### Step 6b - Wiki (only if --wiki flag)
### Steps 6b-8 - Wiki, Neo4j, FalkorDB, SVG, GraphML, MCP, benchmark (only on their flags)
**Only run this step if `--wiki` was explicitly given in the original command.**
Run this before Step 9 (cleanup) so `.graphify_labels.json` is still available.
```bash
graphify export wiki
```
### Step 7 - Neo4j export (only if --neo4j or --neo4j-push flag)
**If `--neo4j`** - generate a Cypher file for manual import:
```bash
graphify export neo4j
```
**If `--neo4j-push <uri>`** - push directly to a running Neo4j instance. Ask the user for credentials if not provided:
```bash
graphify export neo4j --push bolt://localhost:7687 --user neo4j --password PASSWORD
```
Default URI is `bolt://localhost:7687`, default user is `neo4j`. Uses MERGE - safe to re-run without creating duplicates.
### Step 7b - SVG export (only if --svg flag)
```bash
graphify export svg
```
### Step 7c - GraphML export (only if --graphml flag)
```bash
graphify export graphml
```
### Step 7d - MCP server (only if --mcp flag)
```bash
python3 -m graphify.serve graphify-out/graph.json
```
This starts a stdio MCP server that exposes tools: `query_graph`, `get_node`, `get_neighbors`, `get_community`, `god_nodes`, `graph_stats`, `shortest_path`. Add to Claude Desktop or any MCP-compatible agent orchestrator so other agents can query the graph live.
To configure in Claude Desktop, add to `claude_desktop_config.json`:
```json
{
"mcpServers": {
"graphify": {
"command": "python3",
"args": ["-m", "graphify.serve", "/absolute/path/to/graphify-out/graph.json"]
}
}
}
```
### Step 8 - Token reduction benchmark (only if total_words > 5000)
If `total_words` from `graphify-out/.graphify_detect.json` is greater than 5,000, run:
```bash
graphify benchmark
```
Print the output directly in chat. If `total_words <= 5000`, skip silently - the graph value is structural clarity, not token compression, for small corpora.
These run only when their flag is present (`--wiki`, `--neo4j`/`--neo4j-push`, `--falkordb`/`--falkordb-push`, `--svg`, `--graphml`, `--mcp`) or, for the token-reduction benchmark, when `total_words` exceeds 5,000. A default run with no export flags skips all of them. See `references/exports.md` for each one. Run any `--wiki` export before Step 9 cleanup so `.graphify_labels.json` is still available.
---
@ -704,7 +525,10 @@ from graphify.detect import save_manifest
detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text(encoding=\"utf-8\"))
# In --update mode, 'all_files' carries the full corpus; 'files' is the changed
# subset. Full-rebuild mode populates only 'files', so the fallback handles that.
save_manifest(detect.get('all_files') or detect['files'])
# root= relativizes the manifest keys to the scan root (same base as the build),
# so the on-disk manifest is portable across clones/machines and a later --update
# matches cached files instead of missing every one (#1417).
save_manifest(detect.get('all_files') or detect['files'], root='INPUT_PATH')
# Update cumulative cost tracker
extract = json.loads(Path('graphify-out/.graphify_extract.json').read_text(encoding=\"utf-8\"))
@ -730,10 +554,13 @@ cost_path.write_text(json.dumps(cost, indent=2, ensure_ascii=False), encoding=\"
print(f'This run: {input_tok:,} input tokens, {output_tok:,} output tokens')
print(f'All time: {cost[\"total_input_tokens\"]:,} input, {cost[\"total_output_tokens\"]:,} output ({len(cost[\"runs\"])} runs)')
"
rm -f graphify-out/.graphify_detect.json graphify-out/.graphify_extract.json graphify-out/.graphify_ast.json graphify-out/.graphify_semantic.json graphify-out/.graphify_analysis.json graphify-out/.graphify_chunk_*.json
rm -f graphify-out/.graphify_detect.json graphify-out/.graphify_extract.json graphify-out/.graphify_ast.json graphify-out/.graphify_semantic.json graphify-out/.graphify_analysis.json
find graphify-out -maxdepth 1 -name '.graphify_chunk_*.json' -delete 2>/dev/null
rm -f graphify-out/.needs_update 2>/dev/null || true
```
Replace INPUT_PATH with the actual path (same value used in Steps 4-5) so the manifest is relativized to the scan root.
Tell the user (omit the obsidian line unless --obsidian was given):
```
Graph complete. Outputs in PATH_TO_DIR/graphify-out/
@ -783,325 +610,33 @@ if [ ! -f graphify-out/.graphify_python ]; then
fi
```
## For --update (incremental re-extraction)
## For --update and --cluster-only
Use when you've added or modified files since the last run. Only re-extracts changed files - saves tokens and time.
```bash
$(cat graphify-out/.graphify_python) -c "
import sys, json
from graphify.detect import detect_incremental, save_manifest
from pathlib import Path
result = detect_incremental(Path('INPUT_PATH'))
new_total = result.get('new_total', 0)
print(json.dumps(result, indent=2, ensure_ascii=False))
Path('graphify-out/.graphify_incremental.json').write_text(json.dumps(result, ensure_ascii=False), encoding=\"utf-8\")
deleted = list(result.get('deleted_files', []))
if new_total == 0 and not deleted:
print('No files changed since last run. Nothing to update.')
raise SystemExit(0)
if deleted:
print(f'{len(deleted)} deleted file(s) to prune.')
if new_total > 0:
print(f'{new_total} new/changed file(s) to re-extract.')
"
```
Then populate `.graphify_detect.json` so Steps 3A6 (which read it unconditionally) see the right state for an incremental run. `files` carries the changed subset (drives Step 3A AST + Step 3B0 cache check on only what changed); `all_files` carries the full corpus for any step that needs corpus-wide context:
```bash
$(cat graphify-out/.graphify_python) -c "
import json
from pathlib import Path
r = json.loads(Path('graphify-out/.graphify_incremental.json').read_text(encoding=\"utf-8\"))
Path('graphify-out/.graphify_detect.json').write_text(json.dumps({
'files': r.get('new_files', {}),
'all_files': r.get('files', {}),
'total_files': r.get('new_total', 0),
'total_words': r.get('total_words', 0),
'skipped_sensitive': r.get('skipped_sensitive', []),
'needs_graph': True,
}, ensure_ascii=False), encoding=\"utf-8\")
"
```
If new files exist, first check whether all changed files are code files:
```bash
$(cat graphify-out/.graphify_python) -c "
import json
from pathlib import Path
result = json.loads(open('graphify-out/.graphify_incremental.json', encoding='utf-8').read()) if Path('graphify-out/.graphify_incremental.json').exists() else {}
code_exts = {'.py','.ts','.js','.go','.rs','.java','.cpp','.c','.rb','.swift','.kt','.cs','.scala','.php','.cc','.cxx','.hpp','.h','.kts','.lua','.toc','.f','.F','.f90','.F90','.f95','.F95','.f03','.F03','.f08','.F08'}
new_files = result.get('new_files', {})
all_changed = [f for files in new_files.values() for f in files]
code_only = all(Path(f).suffix.lower() in code_exts for f in all_changed)
print('code_only:', code_only)
"
```
If `code_only` is True: print `[graphify update] Code-only changes detected - skipping semantic extraction (no LLM needed)`, run only Step 3A (AST) on the changed files, skip Step 3B entirely (no subagents), then go straight to merge and Steps 48.
If `code_only` is False (any changed file is a doc/paper/image): run the full Steps 3A3C pipeline as normal.
If no new files exist (only deletions), create an empty extraction so the merge step can prune:
```bash
if [ ! -f graphify-out/.graphify_extract.json ]; then
echo '[graphify update] Only deletions -- creating empty extraction for merge.'
$(cat graphify-out/.graphify_python) -c "
import json
from pathlib import Path
Path('graphify-out/.graphify_extract.json').write_text(json.dumps({'nodes':[],'edges':[],'hyperedges':[],'input_tokens':0,'output_tokens':0}), encoding='utf-8')
"
fi
```
Then:
```bash
$(cat graphify-out/.graphify_python) -c "
import json
from pathlib import Path
from graphify.build import build_merge
from graphify.detect import save_manifest
# Load new extraction and incremental state
new_extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text(encoding=\"utf-8\"))
incremental = json.loads(Path('graphify-out/.graphify_incremental.json').read_text(encoding=\"utf-8\"))
deleted = list(incremental.get('deleted_files', []))
# Use build_merge() — reads graph.json directly without NetworkX round-trip
# so edge direction (calls, implements, imports) is always preserved (#801).
G = build_merge(
[new_extraction],
graph_path='graphify-out/graph.json',
prune_sources=deleted or None,
)
print(f'[graphify update] Merged: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges')
# Write merged result back to .graphify_extract.json so Step 4 sees the full graph
merged_out = {
'nodes': [{'id': n, **d} for n, d in G.nodes(data=True)],
'edges': [
# Explicit source/target last so they win over any stale attrs in d.
{**{k: val for k, val in d.items() if k not in ('_src', '_tgt', 'source', 'target')},
'source': d.get('_src', u), 'target': d.get('_tgt', v)}
for u, v, d in G.edges(data=True)
],
# G.graph["hyperedges"] holds hyperedges from both existing graph.json
# and new_extraction (build_merge combines them). Falling back to
# new_extraction only would silently drop prior-run hyperedges (#801).
'hyperedges': list(G.graph.get('hyperedges', [])),
'input_tokens': new_extraction.get('input_tokens', 0),
'output_tokens': new_extraction.get('output_tokens', 0),
}
Path('graphify-out/.graphify_extract.json').write_text(json.dumps(merged_out, ensure_ascii=False), encoding=\"utf-8\")
print(f'[graphify update] Merged extraction written ({len(merged_out[\"nodes\"])} nodes, {len(merged_out[\"edges\"])} edges)')
# Save manifest so next --update diffs against today's state, not the
# prior run's baseline (prevents ghost-node reports on subsequent updates).
save_manifest(incremental['files'])
print('[graphify update] Manifest saved.')
"
```
Then run Steps 48 on the merged graph as normal.
After Step 4, show the graph diff:
```bash
$(cat graphify-out/.graphify_python) -c "
import json
from graphify.analyze import graph_diff
from graphify.build import build_from_json
from networkx.readwrite import json_graph
import networkx as nx
from pathlib import Path
# Load old graph (before update) from backup written before merge
old_data = json.loads(Path('graphify-out/.graphify_old.json').read_text(encoding=\"utf-8\")) if Path('graphify-out/.graphify_old.json').exists() else None
new_extract = json.loads(Path('graphify-out/.graphify_extract.json').read_text(encoding=\"utf-8\"))
G_new = build_from_json(new_extract)
if old_data:
G_old = json_graph.node_link_graph(old_data, edges='links')
diff = graph_diff(G_old, G_new)
print(diff['summary'])
if diff['new_nodes']:
print('New nodes:', ', '.join(n['label'] for n in diff['new_nodes'][:5]))
if diff['new_edges']:
print('New edges:', len(diff['new_edges']))
"
```
Before the merge step, save the old graph: `cp graphify-out/graph.json graphify-out/.graphify_old.json`
Clean up after: `rm -f graphify-out/.graphify_old.json`
---
## For --cluster-only
Skip Steps 13. Re-run clustering on the existing graph:
```bash
graphify cluster-only .
```
Then run Steps 59 as normal (label communities, generate viz, benchmark, clean up, report).
Both are non-default subcommands. `--update` re-extracts only new or changed files; `--cluster-only` reruns clustering on the existing graph. See `references/update.md` for both flows.
---
## For /graphify query
Two traversal modes - choose based on the question:
| Mode | Flag | Best for |
|------|------|----------|
| BFS (default) | _(none)_ | "What is X connected to?" - broad context, nearest neighbors first |
| DFS | `--dfs` | "How does X reach Y?" - trace a specific chain or dependency path |
When `graphify-out/graph.json` already exists and the user asks a question about the corpus, answer from the graph rather than rebuilding it:
```bash
graphify query "QUESTION"
# or: graphify query "QUESTION" --dfs --budget 3000
graphify query "<question>"
```
Replace `QUESTION` with the user's actual question. Answer using **only** what the graph output contains. Quote `source_location` when citing a specific fact. If the graph lacks enough information, say so - do not hallucinate edges.
After writing the answer, save it back into the graph so it improves future queries:
```bash
$(cat graphify-out/.graphify_python) -m graphify save-result --question "QUESTION" --answer "ANSWER" --type query --nodes NODE1 NODE2
```
Replace `QUESTION` with the question, `ANSWER` with your full answer text, `SOURCE_NODES` with the list of node labels you cited. This closes the feedback loop: the next `--update` will extract this Q&A as a node in the graph.
Before traversal, expand the question against the graph's own vocabulary so a wording mismatch does not collapse the answer to noise. If the `graphify query` CLI is unavailable, fall back to an inline NetworkX traversal of `graphify-out/graph.json`. Answer using only what the graph output contains, and quote `source_location` when citing a specific fact. For that vocab-expansion step, the BFS/DFS traversal modes, the `--budget` cap, the NetworkX fallback, `save-result` feedback, and the `/graphify path` and `/graphify explain` flows, see `references/query.md`.
---
## For /graphify path
## For /graphify add and --watch
Find the shortest path between two named concepts in the graph.
```bash
graphify path "NODE_A" "NODE_B"
```
Replace `NODE_A` and `NODE_B` with the actual concept names. Then explain the path in plain language - what each hop means, why it's significant.
After writing the explanation, save it back:
```bash
$(cat graphify-out/.graphify_python) -m graphify save-result --question "Path from NODE_A to NODE_B" --answer "ANSWER" --type path_query --nodes NODE_A NODE_B
```
Neither is part of the default build. When the user runs `/graphify add <url>` to fetch a URL into the corpus, or passes `--watch` to auto-rebuild on file changes, see `references/add-watch.md`.
---
## For /graphify explain
## For the commit hook and native CLAUDE.md integration
Give a plain-language explanation of a single node - everything connected to it.
```bash
graphify explain "NODE_NAME"
```
Replace `NODE_NAME` with the concept the user asked about. Then write a 3-5 sentence explanation of what this node is, what it connects to, and why those connections are significant. Use the source locations as citations.
After writing the explanation, save it back:
```bash
$(cat graphify-out/.graphify_python) -m graphify save-result --question "Explain NODE_NAME" --answer "ANSWER" --type explain --nodes NODE_NAME
```
---
## For /graphify add
Fetch a URL and add it to the corpus, then update the graph.
```bash
$(cat graphify-out/.graphify_python) -c "
import sys
from graphify.ingest import ingest
from pathlib import Path
try:
out = ingest('URL', Path('./raw'), author='AUTHOR', contributor='CONTRIBUTOR')
print(f'Saved to {out}')
except ValueError as e:
print(f'error: {e}', file=sys.stderr)
sys.exit(1)
except RuntimeError as e:
print(f'error: {e}', file=sys.stderr)
sys.exit(1)
"
```
Replace `URL` with the actual URL, `AUTHOR` with the user's name if provided, `CONTRIBUTOR` likewise. If the command exits with an error, tell the user what went wrong - do not silently continue. After a successful save, automatically run the `--update` pipeline on `./raw` to merge the new file into the existing graph.
Supported URL types (auto-detected):
- YouTube / any video URL → audio downloaded via yt-dlp, transcribed to `.txt` on next run (requires `pip install 'graphifyy[video]'`)
- Twitter/X → fetched via oEmbed, saved as `.md` with tweet text and author
- arXiv → abstract + metadata saved as `.md`
- PDF → downloaded as `.pdf`
- Images (.png/.jpg/.webp) → downloaded, Claude vision extracts on next run
- Any webpage → converted to markdown via html2text
---
## For --watch
Start a background watcher that monitors a folder and auto-updates the graph when files change.
```bash
python3 -m graphify.watch INPUT_PATH --debounce 3
```
Replace INPUT_PATH with the folder to watch. Behavior depends on what changed:
- **Code files only (.py, .ts, .go, etc.):** re-runs AST extraction + rebuild + cluster immediately, no LLM needed. `graph.json` and `GRAPH_REPORT.md` are updated automatically.
- **Docs, papers, or images:** writes a `graphify-out/needs_update` flag and prints a notification to run `/graphify --update` (LLM semantic re-extraction required).
Debounce (default 3s): waits until file activity stops before triggering, so a wave of parallel agent writes doesn't trigger a rebuild per file.
Press Ctrl+C to stop.
For agentic workflows: run `--watch` in a background terminal. Code changes from agent waves are picked up automatically between waves. If agents are also writing docs or notes, you'll need a manual `/graphify --update` after those waves.
---
## For git commit hook
Install a post-commit hook that auto-rebuilds the graph after every commit. No background process needed - triggers once per commit, works with any editor.
```bash
graphify hook install # install
graphify hook uninstall # remove
graphify hook status # check
```
After every `git commit`, the hook detects which code files changed (via `git diff HEAD~1`), re-runs AST extraction on those files, and rebuilds `graph.json` and `GRAPH_REPORT.md`. Doc/image changes are ignored by the hook - run `/graphify --update` manually for those.
If a post-commit hook already exists, graphify appends to it rather than replacing it.
---
## For native CLAUDE.md integration
Run once per project to make graphify always-on in Claude Code sessions:
```bash
graphify claude install
```
This writes a `## graphify` section to the local `CLAUDE.md` that instructs Claude to check the graph before answering codebase questions and rebuild it after code changes. No manual `/graphify` needed in future sessions.
```bash
graphify claude uninstall # remove the section
```
When the user asks to install the post-commit auto-rebuild hook or wire graphify into a project's CLAUDE.md, see `references/hooks.md`.
---

View File

@ -0,0 +1,56 @@
# graphify reference: add a URL and watch a folder
Load this when the user ran `/graphify add <url>` or passed `--watch`. Neither is part of the default build.
## For /graphify add
Fetch a URL and add it to the corpus, then update the graph.
```bash
$(cat graphify-out/.graphify_python) -c "
import sys
from graphify.ingest import ingest
from pathlib import Path
try:
out = ingest('URL', Path('./raw'), author='AUTHOR', contributor='CONTRIBUTOR')
print(f'Saved to {out}')
except ValueError as e:
print(f'error: {e}', file=sys.stderr)
sys.exit(1)
except RuntimeError as e:
print(f'error: {e}', file=sys.stderr)
sys.exit(1)
"
```
Replace `URL` with the actual URL, `AUTHOR` with the user's name if provided, `CONTRIBUTOR` likewise. If the command exits with an error, tell the user what went wrong - do not silently continue. After a successful save, automatically run the `--update` pipeline on `./raw` to merge the new file into the existing graph.
Supported URL types (auto-detected):
- YouTube / any video URL → audio downloaded via yt-dlp, transcribed to `.txt` on next run (requires `pip install 'graphifyy[video]'`)
- Twitter/X → fetched via oEmbed, saved as `.md` with tweet text and author
- arXiv → abstract + metadata saved as `.md`
- PDF → downloaded as `.pdf`
- Images (.png/.jpg/.webp) → downloaded, Claude vision extracts on next run
- Any webpage → converted to markdown via html2text
---
## For --watch
Start a background watcher that monitors a folder and auto-updates the graph when files change.
```bash
$(cat graphify-out/.graphify_python) -m graphify.watch INPUT_PATH --debounce 3
```
Replace INPUT_PATH with the folder to watch. Behavior depends on what changed:
- **Code files only (.py, .ts, .go, etc.):** re-runs AST extraction + rebuild + cluster immediately, no LLM needed. `graph.json` and `GRAPH_REPORT.md` are updated automatically.
- **Docs, papers, or images:** writes a `graphify-out/needs_update` flag and prints a notification to run `/graphify --update` (LLM semantic re-extraction required).
Debounce (default 3s): waits until file activity stops before triggering, so a wave of parallel agent writes doesn't trigger a rebuild per file.
Press Ctrl+C to stop.
For agentic workflows: run `--watch` in a background terminal. Code changes from agent waves are picked up automatically between waves. If agents are also writing docs or notes, you'll need a manual `/graphify --update` after those waves.

View File

@ -0,0 +1,87 @@
# graphify reference: extra exports and benchmark
Load this when the user passed one of the export flags (`--wiki`, `--neo4j`, `--neo4j-push`, `--falkordb`, `--falkordb-push`, `--svg`, `--graphml`, `--mcp`), or when the corpus is large enough for the token-reduction benchmark. Each step runs only for its own flag.
### Step 6b - Wiki (only if --wiki flag)
**Only run this step if `--wiki` was explicitly given in the original command.**
Run this before Step 9 (cleanup) so `.graphify_labels.json` is still available.
```bash
graphify export wiki
```
### Step 7 - Neo4j export (only if --neo4j or --neo4j-push flag)
**If `--neo4j`** - generate a Cypher file for manual import:
```bash
graphify export neo4j
```
**If `--neo4j-push <uri>`** - push directly to a running Neo4j instance. Ask the user for credentials if not provided:
```bash
graphify export neo4j --push bolt://localhost:7687 --user neo4j --password PASSWORD
```
Default URI is `bolt://localhost:7687`, default user is `neo4j`. Uses MERGE - safe to re-run without creating duplicates.
### Step 7a - FalkorDB export (only if --falkordb or --falkordb-push flag)
**If `--falkordb`** - generate a Cypher file. The statements are OpenCypher, but FalkorDB's `GRAPH.QUERY` runs one statement at a time (no bulk script import like Neo4j's `cypher-shell`), so prefer `--falkordb-push` to load a graph. Use this only when you want the portable `cypher.txt` artifact:
```bash
graphify export falkordb
```
**If `--falkordb-push <uri>`** - push directly to a running FalkorDB instance. Credentials are optional; ask the user only if the instance requires auth:
```bash
graphify export falkordb --push falkordb://localhost:6379
```
Default URI is `falkordb://localhost:6379` (the scheme is informational - `redis://` or a bare `host:port` work too), auth is optional, and the target graph defaults to `graphify`. Uses MERGE - safe to re-run without creating duplicates.
### Step 7b - SVG export (only if --svg flag)
```bash
graphify export svg
```
### Step 7c - GraphML export (only if --graphml flag)
```bash
graphify export graphml
```
### Step 7d - MCP server (only if --mcp flag)
```bash
$(cat graphify-out/.graphify_python) -m graphify.serve graphify-out/graph.json
```
This starts a stdio MCP server that exposes tools: `query_graph`, `get_node`, `get_neighbors`, `get_community`, `god_nodes`, `graph_stats`, `shortest_path`. Add to Claude Desktop or any MCP-compatible agent orchestrator so other agents can query the graph live.
To configure in Claude Desktop, add to `claude_desktop_config.json`. Claude Desktop can't run `$(...)`, and under `uv tool install` the system `python3` can't import graphify — so set `command` to the **absolute interpreter path** printed by `cat graphify-out/.graphify_python`:
```json
{
"mcpServers": {
"graphify": {
"command": "<absolute path from: cat graphify-out/.graphify_python>",
"args": ["-m", "graphify.serve", "/absolute/path/to/graphify-out/graph.json"]
}
}
}
```
### Step 8 - Token reduction benchmark (only if total_words > 5000)
If `total_words` from `graphify-out/.graphify_detect.json` is greater than 5,000, run:
```bash
graphify benchmark
```
Print the output directly in chat. If `total_words <= 5000`, skip silently - the graph value is structural clarity, not token compression, for small corpora.

View File

@ -0,0 +1,70 @@
# graphify reference: extraction subagent prompt
Load this in Step 3 Part B when the corpus has at least one doc, paper, or image chunk. A pure-code corpus skips Part B and never reads this file. Each semantic subagent receives the prompt below verbatim (substitute FILE_LIST, CHUNK_NUM, TOTAL_CHUNKS, DEEP_MODE, and CHUNK_PATH).
```
You are a graphify extraction subagent. Read the files listed and extract a knowledge graph fragment.
Output ONLY valid JSON matching the schema below - no explanation, no markdown fences, no preamble.
Files (chunk CHUNK_NUM of TOTAL_CHUNKS):
FILE_LIST
Rules:
- EXTRACTED: relationship explicit in source (import, call, citation, "see §3.2")
- INFERRED: reasonable inference (shared data structure, implied dependency)
- AMBIGUOUS: uncertain - flag for review, do not omit
Code files: focus on semantic edges AST cannot find (call relationships, shared data, arch patterns).
Do not re-extract imports - AST already has those.
Doc/paper files: extract named concepts, entities, citations. For rationale (WHY decisions were made, trade-offs, design intent): store as a `rationale` attribute on the relevant concept node — do NOT create a separate rationale node or fragment node. Only create a node for something that is itself a named entity or concept. Use `file_type:"rationale"` for concept-like nodes (ideas, principles, mechanisms, design patterns). `file_type` MUST be one of exactly these six values: `code`, `document`, `paper`, `image`, `rationale`, `concept`. Any other value is invalid and will be rejected.
Code files: when adding `calls` edges, source MUST be the caller (the function/class doing the calling), target MUST be the callee. Never reverse this direction. `calls` edges MUST stay within one language: a Python function cannot `calls` a JS/TS/Go/Rust/Java symbol and vice versa — cross-language call edges are phantom artifacts, never emit them.
Image files: use vision to understand what the image IS - do not just OCR.
UI screenshot: layout patterns, design decisions, key elements, purpose.
Chart: metric, trend/insight, data source.
Tweet/post: claim as node, author, concepts mentioned.
Diagram: components and connections.
Research figure: what it demonstrates, method, result.
Handwritten/whiteboard: ideas and arrows, mark uncertain readings AMBIGUOUS.
DEEP_MODE (if --mode deep was given): be aggressive with INFERRED edges - indirect deps,
shared assumptions, latent couplings. Mark uncertain ones AMBIGUOUS instead of omitting.
Semantic similarity: if two concepts in this chunk solve the same problem or represent the same idea without any structural link (no import, no call, no citation), add a `semantically_similar_to` edge marked INFERRED with a confidence_score reflecting how similar they are (0.6-0.95). Examples:
- Two functions that both validate user input but never call each other
- A class in code and a concept in a paper that describe the same algorithm
- Two error types that handle the same failure mode differently
Only add these when the similarity is genuinely non-obvious and cross-cutting. Do not add them for trivially similar things.
Hyperedges: if 3 or more nodes clearly participate together in a shared concept, flow, or pattern that is not captured by pairwise edges alone, add a hyperedge to a top-level `hyperedges` array. Examples:
- All classes that implement a common protocol or interface
- All functions in an authentication flow (even if they don't all call each other)
- All concepts from a paper section that form one coherent idea
Use sparingly — only when the group relationship adds information beyond the pairwise edges. Maximum 3 hyperedges per chunk.
If a file has YAML frontmatter (--- ... ---), copy source_url, captured_at, author,
contributor onto every node from that file.
confidence_score is REQUIRED on every edge - never omit it, never use 0.5 as a default:
- EXTRACTED edges: confidence_score = 1.0 always
- INFERRED edges: pick exactly ONE value from this set — never 0.5:
0.95 direct structural evidence (shared data structure, named cross-file reference).
0.85 strong inference (clear functional alignment, no direct symbol link).
0.75 reasonable inference (shared problem domain + similar shape, requires interpretation).
0.65 weak inference (thematically related, no shape evidence).
0.55 speculative but plausible (surface-level co-occurrence only).
Models follow discrete rubrics better than continuous ranges; the bimodal
distribution observed in production (>50% at 0.5, >40% at 0.85+) shows the
range guidance is being collapsed to a binary. If no value above fits, mark
the edge AMBIGUOUS rather than picking 0.4 or below.
- AMBIGUOUS edges: 0.1-0.3
Node ID format: lowercase, only `[a-z0-9_]`, no dots or slashes. Format: `{stem}_{entity}` where stem is `{parent_dir}_{filename_without_ext}` (the **immediate** parent directory name + the filename stem, both lowercased with non-alphanumeric chars replaced by `_`) and entity is the symbol name similarly normalized. Only one level of parent is used — not the full path. Examples: `src/auth/session.py` + `ValidateToken``auth_session_validatetoken`; `lib/utils/helpers.py` + `parse_url``utils_helpers_parse_url`; `tests/test_foo.py` + `_helper``tests_test_foo_helper`. Top-level files (no parent dir, e.g. `setup.py`) use just the filename stem: `setup_my_func`. This must match the ID the AST extractor generates — using just the filename (e.g., `session_validatetoken`) or the full path (e.g., `src_auth_session_validatetoken`) will create orphan ghost-duplicate nodes. If you are re-extracting a project that had ghost duplicates under the old format, the user should run `graphify extract --force` to rebuild cleanly. CRITICAL: never append chunk numbers, sequence numbers, or any suffix to an ID (no `_c1`, `_c2`, `_chunk2`, etc.). IDs must be deterministic from the label alone — the same entity must always produce the same ID regardless of which chunk processes it.
Generate the extraction JSON matching this schema exactly:
{"nodes":[{"id":"auth_session_validatetoken","label":"Human Readable Name","file_type":"code|document|paper|image|rationale|concept","source_file":"<FILE_LIST path verbatim>","source_location":null,"source_url":null,"captured_at":null,"author":null,"contributor":null}],"edges":[{"source":"node_id","target":"node_id","relation":"calls|implements|references|cites|conceptually_related_to|shares_data_with|semantically_similar_to|rationale_for","confidence":"EXTRACTED|INFERRED|AMBIGUOUS","confidence_score":1.0,"source_file":"<FILE_LIST path verbatim>","source_location":null,"weight":1.0}],"hyperedges":[{"id":"snake_case_id","label":"Human Readable Label","nodes":["node_id1","node_id2","node_id3"],"relation":"participate_in|implement|form","confidence":"EXTRACTED|INFERRED","confidence_score":0.75,"source_file":"<FILE_LIST path verbatim>"}],"input_tokens":0,"output_tokens":0}
source_file RULE (every node, edge, and hyperedge): set source_file to the path of the originating file EXACTLY as it appears in FILE_LIST — verbatim and absolute. Do NOT shorten to a basename, do NOT re-relativize, do NOT strip any directory prefix, and do NOT change separators (the engine canonicalizes separators and relativizes against the build root downstream). Copy the FILE_LIST entry character-for-character. This keeps the full build and incremental --update on the same base, so build_merge's replace-on-re-extract matches the existing node instead of accumulating a duplicate.
Then write the JSON to disk using the Write tool at this exact absolute path (no relative paths — Write resolves relative paths against an undefined cwd and the file will be silently lost):
CHUNK_PATH
```

View File

@ -0,0 +1,46 @@
# graphify reference: GitHub clone and cross-repo merge
Load this when the user passed one or more `https://github.com/...` URLs, or named several local subfolders to merge into one graph.
### Step 0 - Clone GitHub repo(s) (only if a GitHub URL was given)
**Single repo:**
```bash
LOCAL_PATH=$(graphify clone <github-url> [--branch <branch>])
# Use LOCAL_PATH as the target for all subsequent steps
```
**Multiple repos (cross-repo graph):**
```bash
# Clone each repo, run the full pipeline on each, then merge
graphify clone <url1> # → ~/.graphify/repos/<owner1>/<repo1>
graphify clone <url2> # → ~/.graphify/repos/<owner2>/<repo2>
# Run /graphify on each local path to produce their graph.json files
# Then merge:
graphify merge-graphs \
~/.graphify/repos/<owner1>/<repo1>/graphify-out/graph.json \
~/.graphify/repos/<owner2>/<repo2>/graphify-out/graph.json \
--out graphify-out/cross-repo-graph.json
```
Graphify clones into `~/.graphify/repos/<owner>/<repo>` and reuses existing clones on repeat runs. Each node in the merged graph carries a `repo` attribute so you can filter by origin.
**Multiple local subfolders (monorepo or multi-service layout):**
The skill pipeline writes all intermediate and final outputs to `graphify-out/` in the current working directory. Running the skill on each subfolder separately will clobber the same output dir. Instead, use the CLI directly for each subfolder — it places `graphify-out/` *inside* the scanned path:
```bash
graphify extract ./core/ # → ./core/graphify-out/graph.json
graphify extract ./service/ # → ./service/graphify-out/graph.json
graphify extract ./platform/ # → ./platform/graphify-out/graph.json
# Add --backend gemini|kimi|openai|deepseek|claude-cli depending on which API key you have set
# Then merge at the project root:
graphify merge-graphs \
./core/graphify-out/graph.json \
./service/graphify-out/graph.json \
./platform/graphify-out/graph.json \
--out graphify-out/graph.json
```
Once `graphify-out/graph.json` exists, the fast path above takes over: any codebase question runs `graphify query` directly on the merged graph — no re-extraction, no size gate.

View File

@ -0,0 +1,33 @@
# graphify reference: commit hook and native CLAUDE.md integration
Load this when the user asked to install the post-commit hook or wire graphify into a project's CLAUDE.md.
## For git commit hook
Install a post-commit hook that auto-rebuilds the graph after every commit. No background process needed - triggers once per commit, works with any editor.
```bash
graphify hook install # install
graphify hook uninstall # remove
graphify hook status # check
```
After every `git commit`, the hook detects which code files changed (via `git diff HEAD~1`), re-runs AST extraction on those files, and rebuilds `graph.json` and `GRAPH_REPORT.md`. Doc/image changes are ignored by the hook - run `/graphify --update` manually for those.
If a post-commit hook already exists, graphify appends to it rather than replacing it.
---
## For native CLAUDE.md integration
Run once per project to make graphify always-on in Claude Code sessions:
```bash
graphify claude install
```
This writes a `## graphify` section to the local `CLAUDE.md` that instructs Claude to check the graph before answering codebase questions and rebuild it after code changes. No manual `/graphify` needed in future sessions.
```bash
graphify claude uninstall # remove the section
```

View File

@ -0,0 +1,303 @@
# graphify reference: query, path, explain
Load this when the user asks a question against an existing graph, or runs `/graphify path` or `/graphify explain`. The core's query stub points here for the full traversal flow. These flows use the `graphify query` CLI when it is available and fall back to an inline NetworkX traversal otherwise.
Two traversal modes - choose based on the question:
| Mode | Flag | Best for |
|------|------|----------|
| BFS (default) | _(none)_ | "What is X connected to?" - broad context, nearest neighbors first |
| DFS | `--dfs` | "How does X reach Y?" - trace a specific chain or dependency path |
First check the graph exists:
```bash
$(cat graphify-out/.graphify_python) -c "
from pathlib import Path
if not Path('graphify-out/graph.json').exists():
print('ERROR: No graph found. Run /graphify <path> first to build the graph.')
raise SystemExit(1)
"
```
If it fails, stop and tell the user to run `/graphify <path>` first.
### Step 0 — Constrained query expansion (REQUIRED before traversal)
graphify's `query` CLI matches nodes via case-folded substring + IDF — there is **no stemming, no synonyms, no cross-language match** inside the binary, and the inline fallback below matches the same way. If the user's question uses different language or different domain vocabulary than the graph's labels (user says "обработчик" / graph says "handler"; user says "authentication" / graph says "Guardian"), the literal matcher returns 0 hits and the answer collapses to noise.
Fix this **without inventing tokens** by expanding the query against the actual graph vocabulary first:
1. Extract the token vocabulary from node labels:
```bash
$(cat graphify-out/.graphify_python) -c "
import json, re
from pathlib import Path
data = json.loads(Path('graphify-out/graph.json').read_text())
vocab = set()
for n in data['nodes']:
for c in re.findall(r'[^\W\d_]+', n.get('label','') or '', re.UNICODE):
parts = re.findall(r'[A-Z]+(?=[A-Z][a-z])|[A-Z]?[a-z]+|[A-Z]+', c) or [c]
for p in parts:
t = p.lower()
if 3 <= len(t) <= 30:
vocab.add(t)
Path('graphify-out/.vocab.txt').write_text('\n'.join(sorted(vocab)))
print(f'vocab: {len(vocab)} tokens')
"
```
2. Read `graphify-out/.vocab.txt`. Then for the user's question, select **up to 12 tokens from this exact list** that semantically match the query intent. Hard constraints:
- You MUST pick only tokens present in the vocabulary file. Do NOT invent tokens.
- If a query concept has no plausible token in the vocab, skip it — do not substitute a near-synonym from training memory.
- If **no** vocab tokens match the query at all, output an empty list and tell the user the corpus has no relevant vocabulary for this question. Do not fabricate a search.
- Translate cross-language: Russian "аутентификация" → look for `auth`, `credential`, `token`, `security` IFF present in vocab.
- Morphology: "handlers" maps to `handler` IFF present; "todos" maps to `todo` IFF present.
3. Print the selection explicitly to the user before running the query, so the expansion is auditable:
```
Query expanded to (from graph vocab, N tokens): [token1, token2, ...]
```
If the list is empty, say so plainly and stop — do not proceed to traversal.
### Step 1 — Traversal
Build the **expanded query string** by joining the selected tokens with spaces. Use this string as `QUESTION` below — NOT the original user question. (The original question is preserved only for `save-result` at the end.)
Prefer the CLI when it is installed:
```bash
graphify query "QUESTION"
# or: graphify query "QUESTION" --dfs --budget 3000
```
If the CLI is unavailable, load `graphify-out/graph.json` and run the traversal inline:
1. Find the 1-3 nodes whose label best matches the expanded tokens.
2. Run the appropriate traversal from each starting node.
3. Read the subgraph - node labels, edge relations, confidence tags, source locations.
4. Answer using **only** what the graph contains. Quote `source_location` when citing a specific fact.
5. If the graph lacks enough information, say so - do not hallucinate edges.
```bash
$(cat graphify-out/.graphify_python) -c "
import sys, json
from networkx.readwrite import json_graph
import networkx as nx
from pathlib import Path
data = json.loads(Path('graphify-out/graph.json').read_text())
G = json_graph.node_link_graph(data, edges='links')
question = 'QUESTION'
mode = 'MODE' # 'bfs' or 'dfs'
terms = [t.lower() for t in question.split() if len(t) >= 3] # match the vocab threshold; keeps api/jwt/ios (#1392)
# Find best-matching start nodes
scored = []
for nid, ndata in G.nodes(data=True):
label = ndata.get('label', '').lower()
score = sum(1 for t in terms if t in label)
if score > 0:
scored.append((score, nid))
scored.sort(reverse=True)
start_nodes = [nid for _, nid in scored[:3]]
if not start_nodes:
print('No matching nodes found for query terms:', terms)
sys.exit(0)
subgraph_nodes = set()
subgraph_edges = []
if mode == 'dfs':
# DFS: follow one path as deep as possible before backtracking.
# Depth-limited to 6 to avoid traversing the whole graph.
visited = set()
stack = [(n, 0) for n in reversed(start_nodes)]
while stack:
node, depth = stack.pop()
if node in visited or depth > 6:
continue
visited.add(node)
subgraph_nodes.add(node)
for neighbor in G.neighbors(node):
if neighbor not in visited:
stack.append((neighbor, depth + 1))
subgraph_edges.append((node, neighbor))
else:
# BFS: explore all neighbors layer by layer up to depth 3.
frontier = set(start_nodes)
subgraph_nodes = set(start_nodes)
for _ in range(3):
next_frontier = set()
for n in frontier:
for neighbor in G.neighbors(n):
if neighbor not in subgraph_nodes:
next_frontier.add(neighbor)
subgraph_edges.append((n, neighbor))
subgraph_nodes.update(next_frontier)
frontier = next_frontier
# Token-budget aware output: rank by relevance, cut at budget (~4 chars/token)
token_budget = BUDGET # default 2000
char_budget = token_budget * 4
# Score each node by term overlap for ranked output
def relevance(nid):
label = G.nodes[nid].get('label', '').lower()
return sum(1 for t in terms if t in label)
ranked_nodes = sorted(subgraph_nodes, key=relevance, reverse=True)
lines = [f'Traversal: {mode.upper()} | Start: {[G.nodes[n].get(\"label\",n) for n in start_nodes]} | {len(subgraph_nodes)} nodes']
for nid in ranked_nodes:
d = G.nodes[nid]
lines.append(f' NODE {d.get(\"label\", nid)} [src={d.get(\"source_file\",\"\")} loc={d.get(\"source_location\",\"\")}]')
for u, v in subgraph_edges:
if u in subgraph_nodes and v in subgraph_nodes:
_raw = G[u][v]; d = next(iter(_raw.values()), {}) if isinstance(G, nx.MultiGraph) else _raw
lines.append(f' EDGE {G.nodes[u].get(\"label\",u)} --{d.get(\"relation\",\"\")} [{d.get(\"confidence\",\"\")}]--> {G.nodes[v].get(\"label\",v)}')
output = '\n'.join(lines)
if len(output) > char_budget:
output = output[:char_budget] + f'\n... (truncated at ~{token_budget} token budget - use --budget N for more)'
print(output)
"
```
Replace `QUESTION` with the **expanded** query string, `MODE` with `bfs` or `dfs`, and `BUDGET` with the token budget (default `2000`, or whatever `--budget N` specifies). Then answer based on the subgraph output above, using only what the graph contains.
After writing the answer, save it back into the graph so it improves future queries. Include the expanded tokens inside the `--answer` text (e.g. `"Expanded from original query via vocab: [tokens]. Then traversed..."`) so the next `--update` extracts the expansion history as a graph node:
```bash
$(cat graphify-out/.graphify_python) -m graphify save-result --question "ORIGINAL_QUESTION" --answer "ANSWER" --type query --nodes NODE1 NODE2
```
Replace `ORIGINAL_QUESTION` with the user's verbatim question, `ANSWER` with your full answer text (containing the expanded-token trace), `NODE1 NODE2` with the list of node labels you cited. This closes the feedback loop: the next `--update` will extract this Q&A as a node in the graph.
---
## For /graphify path
Find the shortest path between two named concepts in the graph. Prefer the CLI when installed:
```bash
graphify path "NODE_A" "NODE_B"
```
If the CLI is unavailable, run it inline:
```bash
$(cat graphify-out/.graphify_python) -c "
import json, sys
import networkx as nx
from networkx.readwrite import json_graph
from pathlib import Path
data = json.loads(Path('graphify-out/graph.json').read_text())
G = json_graph.node_link_graph(data, edges='links')
a_term = 'NODE_A'
b_term = 'NODE_B'
def find_node(term):
term = term.lower()
scored = sorted(
[(sum(1 for w in term.split() if w in G.nodes[n].get('label','').lower()), n)
for n in G.nodes()],
reverse=True
)
return scored[0][1] if scored and scored[0][0] > 0 else None
src = find_node(a_term)
tgt = find_node(b_term)
if not src or not tgt:
print(f'Could not find nodes matching: {a_term!r} or {b_term!r}')
sys.exit(0)
try:
path = nx.shortest_path(G, src, tgt)
print(f'Shortest path ({len(path)-1} hops):')
for i, nid in enumerate(path):
label = G.nodes[nid].get('label', nid)
if i < len(path) - 1:
_raw = G[nid][path[i+1]]; edge = next(iter(_raw.values()), {}) if isinstance(G, nx.MultiGraph) else _raw
rel = edge.get('relation', '')
conf = edge.get('confidence', '')
print(f' {label} --{rel}--> [{conf}]')
else:
print(f' {label}')
except nx.NetworkXNoPath:
print(f'No path found between {a_term!r} and {b_term!r}')
except nx.NodeNotFound as e:
print(f'Node not found: {e}')
"
```
Replace `NODE_A` and `NODE_B` with the actual concept names from the user. Then explain the path in plain language - what each hop means, why it's significant.
After writing the explanation, save it back:
```bash
$(cat graphify-out/.graphify_python) -m graphify save-result --question "Path from NODE_A to NODE_B" --answer "ANSWER" --type path_query --nodes NODE_A NODE_B
```
---
## For /graphify explain
Give a plain-language explanation of a single node - everything connected to it. Prefer the CLI when installed:
```bash
graphify explain "NODE_NAME"
```
If the CLI is unavailable, run it inline:
```bash
$(cat graphify-out/.graphify_python) -c "
import json, sys
import networkx as nx
from networkx.readwrite import json_graph
from pathlib import Path
data = json.loads(Path('graphify-out/graph.json').read_text())
G = json_graph.node_link_graph(data, edges='links')
term = 'NODE_NAME'
term_lower = term.lower()
# Find best matching node
scored = sorted(
[(sum(1 for w in term_lower.split() if w in G.nodes[n].get('label','').lower()), n)
for n in G.nodes()],
reverse=True
)
if not scored or scored[0][0] == 0:
print(f'No node matching {term!r}')
sys.exit(0)
nid = scored[0][1]
data_n = G.nodes[nid]
print(f'NODE: {data_n.get(\"label\", nid)}')
print(f' source: {data_n.get(\"source_file\",\"unknown\")}')
print(f' type: {data_n.get(\"file_type\",\"unknown\")}')
print(f' degree: {G.degree(nid)}')
print()
print('CONNECTIONS:')
for neighbor in G.neighbors(nid):
_raw = G[nid][neighbor]; edge = next(iter(_raw.values()), {}) if isinstance(G, nx.MultiGraph) else _raw
nlabel = G.nodes[neighbor].get('label', neighbor)
rel = edge.get('relation', '')
conf = edge.get('confidence', '')
src_file = G.nodes[neighbor].get('source_file', '')
print(f' --{rel}--> {nlabel} [{conf}] ({src_file})')
"
```
Replace `NODE_NAME` with the concept the user asked about. Then write a 3-5 sentence explanation of what this node is, what it connects to, and why those connections are significant. Use the source locations as citations.
After writing the explanation, save it back:
```bash
$(cat graphify-out/.graphify_python) -m graphify save-result --question "Explain NODE_NAME" --answer "ANSWER" --type explain --nodes NODE_NAME
```

View File

@ -0,0 +1,52 @@
# graphify reference: transcribe video and audio
Load this only when `detect` reported one or more `video` files. A corpus with no video never reads this.
### Step 2.5 - Transcribe video / audio files (only if video files detected)
Skip this step entirely if `detect` returned zero `video` files.
Video and audio files cannot be read directly. Transcribe them to text first, then treat the transcripts as doc files in Step 3.
**Strategy:** Read the god nodes from `graphify-out/.graphify_detect.json` (or the analysis file if it exists from a previous run). You are already a language model — write a one-sentence domain hint yourself from those labels. Then pass it to Whisper as the initial prompt. No separate API call needed.
**However**, if the corpus has *only* video files and no other docs/code, use the generic fallback prompt: `"Use proper punctuation and paragraph breaks."`
**Step 1 - Write the Whisper prompt yourself.**
Read the top god node labels from detect output or analysis, then compose a short domain hint sentence, for example:
- Labels: `transformer, attention, encoder, decoder``"Machine learning research on transformer architectures and attention mechanisms. Use proper punctuation and paragraph breaks."`
- Labels: `kubernetes, deployment, pod, helm``"DevOps discussion about Kubernetes deployments and Helm charts. Use proper punctuation and paragraph breaks."`
**Export** it as `GRAPHIFY_WHISPER_PROMPT` (the exact name the transcriber reads — and it must be `export`ed so the child Python process sees it) for the next command.
**Step 2 - Transcribe:**
```bash
export GRAPHIFY_WHISPER_MODEL=base # or whatever --whisper-model the user passed (must be exported)
export GRAPHIFY_WHISPER_PROMPT="<the one-sentence domain hint you composed in Step 1>"
$(cat graphify-out/.graphify_python) -c "
import json, os, sys
from pathlib import Path
from graphify.transcribe import transcribe_all
detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text(encoding=\"utf-8\"))
video_files = detect.get('files', {}).get('video', [])
prompt = os.environ.get('GRAPHIFY_WHISPER_PROMPT', 'Use proper punctuation and paragraph breaks.')
transcript_paths = transcribe_all(video_files, initial_prompt=prompt)
# Write the JSON from Python (NOT a shell '>' redirect): transcribe_all/Whisper
# print progress to stdout, which would otherwise corrupt the JSON file (#1392).
Path('graphify-out/.graphify_transcripts.json').write_text(json.dumps(transcript_paths, ensure_ascii=False), encoding=\"utf-8\")
print(f'Transcribed {len(transcript_paths)} file(s)', file=sys.stderr)
"
```
After transcription:
- Read the transcript paths from `graphify-out/.graphify_transcripts.json`
- Add them to the docs list before dispatching semantic subagents in Step 3B
- Print how many transcripts were created: `Transcribed N video file(s) -> treating as docs`
- If transcription fails for a file, print a warning and continue with the rest
**Whisper model:** Default is `base`. If the user passed `--whisper-model <name>`, `export GRAPHIFY_WHISPER_MODEL=<name>` (it must be exported, not just assigned) before running the command above.

View File

@ -0,0 +1,192 @@
# graphify reference: incremental update and cluster-only
Load this only when the user passed `--update` or `--cluster-only`. A first-time full build never reads this file.
## For --update (incremental re-extraction)
Use when you've added or modified files since the last run. Only re-extracts changed files - saves tokens and time.
```bash
$(cat graphify-out/.graphify_python) -c "
import sys, json
from graphify.detect import detect_incremental, save_manifest
from pathlib import Path
result = detect_incremental(Path('INPUT_PATH'))
new_total = result.get('new_total', 0)
print(json.dumps(result, indent=2, ensure_ascii=False))
Path('graphify-out/.graphify_incremental.json').write_text(json.dumps(result, ensure_ascii=False), encoding=\"utf-8\")
deleted = list(result.get('deleted_files', []))
if new_total == 0 and not deleted:
print('No files changed since last run. Nothing to update.')
raise SystemExit(0)
if deleted:
print(f'{len(deleted)} deleted file(s) to prune.')
if new_total > 0:
print(f'{new_total} new/changed file(s) to re-extract.')
"
```
Then populate `.graphify_detect.json` so Steps 3A6 (which read it unconditionally) see the right state for an incremental run. `files` carries the changed subset (drives Step 3A AST + Step 3B0 cache check on only what changed); `all_files` carries the full corpus for any step that needs corpus-wide context:
```bash
$(cat graphify-out/.graphify_python) -c "
import json
from pathlib import Path
r = json.loads(Path('graphify-out/.graphify_incremental.json').read_text(encoding=\"utf-8\"))
Path('graphify-out/.graphify_detect.json').write_text(json.dumps({
'files': r.get('new_files', {}),
'all_files': r.get('files', {}),
'total_files': r.get('new_total', 0),
'total_words': r.get('total_words', 0),
'skipped_sensitive': r.get('skipped_sensitive', []),
'needs_graph': True,
}, ensure_ascii=False), encoding=\"utf-8\")
"
```
If new files exist, first check whether all changed files are code files:
```bash
$(cat graphify-out/.graphify_python) -c "
import json
from pathlib import Path
result = json.loads(open('graphify-out/.graphify_incremental.json', encoding='utf-8').read()) if Path('graphify-out/.graphify_incremental.json').exists() else {}
code_exts = {'.py','.ts','.js','.go','.rs','.java','.cpp','.c','.rb','.swift','.kt','.cs','.scala','.php','.cc','.cxx','.hpp','.h','.kts','.lua','.toc','.f','.F','.f90','.F90','.f95','.F95','.f03','.F03','.f08','.F08'}
new_files = result.get('new_files', {})
all_changed = [f for files in new_files.values() for f in files]
code_only = all(Path(f).suffix.lower() in code_exts for f in all_changed)
print('code_only:', code_only)
"
```
If `code_only` is True: print `[graphify update] Code-only changes detected - skipping semantic extraction (no LLM needed)`, run only Step 3A (AST) on the changed files, skip Step 3B entirely (no subagents), then go straight to merge and Steps 48.
If `code_only` is False (any changed file is a doc/paper/image/video): **first, if any changed file is in `new_files['video']`, run `references/transcribe.md` (Step 2.5) on those files, then rewrite `.graphify_detect.json` to move the resulting transcript paths into `files['document']` and drop `files['video']`** — otherwise raw `.mp4/.mp3` paths are fed to semantic subagents as unreadable media (#1392). Then run the full Steps 3A3C pipeline as normal.
If no new files exist (only deletions), create an empty extraction so the merge step can prune:
```bash
if [ ! -f graphify-out/.graphify_extract.json ]; then
echo '[graphify update] Only deletions -- creating empty extraction for merge.'
$(cat graphify-out/.graphify_python) -c "
import json
from pathlib import Path
Path('graphify-out/.graphify_extract.json').write_text(json.dumps({'nodes':[],'edges':[],'hyperedges':[],'input_tokens':0,'output_tokens':0}), encoding='utf-8')
"
fi
```
Then:
```bash
$(cat graphify-out/.graphify_python) -c "
import json
from pathlib import Path
from graphify.build import build_merge
from graphify.detect import save_manifest
# Load new extraction and incremental state
new_extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text(encoding=\"utf-8\"))
incremental = json.loads(Path('graphify-out/.graphify_incremental.json').read_text(encoding=\"utf-8\"))
deleted = list(incremental.get('deleted_files', []))
# prune_sources is ONLY for genuinely DELETED files. Changed/re-extracted files are
# handled by build_merge's replace-on-re-extract (#1344): every source_file in
# new_chunks is dropped from the base before merge, so old/stale nodes don't survive.
# Do NOT add `changed` here: with root= passed, prune_set relativizes to the same base
# as the freshly merged nodes and would DELETE the re-extracted content (#1178 is moot
# now that replace — not the dedup pass — reconciles changed files).
prune = list(deleted) or None
# Use build_merge() — reads graph.json directly without NetworkX round-trip
# so edge direction (calls, implements, imports) is always preserved (#801).
# Pass root= so prune_sources (absolute paths from detect_incremental) are
# relativized to match the graph's relative source_file values; without it
# nothing is pruned and stale nodes accumulate on every update (#1361).
# directed=IS_DIRECTED: replace IS_DIRECTED with True if --directed was given, else
# False. Without it a --directed --update silently rebuilds undirected and collapses
# reciprocal A<->B edges (#1392).
G = build_merge(
[new_extraction],
graph_path='graphify-out/graph.json',
prune_sources=prune,
root='INPUT_PATH',
directed=IS_DIRECTED,
)
print(f'[graphify update] Merged: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges')
# Write merged result back to .graphify_extract.json so Step 4 sees the full graph
merged_out = {
'nodes': [{'id': n, **d} for n, d in G.nodes(data=True)],
'edges': [
# Explicit source/target last so they win over any stale attrs in d.
{**{k: val for k, val in d.items() if k not in ('_src', '_tgt', 'source', 'target')},
'source': d.get('_src', u), 'target': d.get('_tgt', v)}
for u, v, d in G.edges(data=True)
],
# G.graph["hyperedges"] holds hyperedges from both existing graph.json
# and new_extraction (build_merge combines them). Falling back to
# new_extraction only would silently drop prior-run hyperedges (#801).
'hyperedges': list(G.graph.get('hyperedges', [])),
'input_tokens': new_extraction.get('input_tokens', 0),
'output_tokens': new_extraction.get('output_tokens', 0),
}
Path('graphify-out/.graphify_extract.json').write_text(json.dumps(merged_out, ensure_ascii=False), encoding=\"utf-8\")
print(f'[graphify update] Merged extraction written ({len(merged_out[\"nodes\"])} nodes, {len(merged_out[\"edges\"])} edges)')
# Save manifest so next --update diffs against today's state, not the
# prior run's baseline (prevents ghost-node reports on subsequent updates).
# root= matches the build_merge call above so the manifest keys stay relative to
# the scan root — portable across clones/machines, so --update keeps matching
# cached files instead of missing every one after a move (#1417).
save_manifest(incremental['files'], root='INPUT_PATH')
print('[graphify update] Manifest saved.')
"
```
Then run Steps 48 on the merged graph as normal.
After Step 4, show the graph diff:
```bash
$(cat graphify-out/.graphify_python) -c "
import json
from graphify.analyze import graph_diff
from graphify.build import build_from_json
from networkx.readwrite import json_graph
import networkx as nx
from pathlib import Path
# Load old graph (before update) from backup written before merge
old_data = json.loads(Path('graphify-out/.graphify_old.json').read_text(encoding=\"utf-8\")) if Path('graphify-out/.graphify_old.json').exists() else None
new_extract = json.loads(Path('graphify-out/.graphify_extract.json').read_text(encoding=\"utf-8\"))
G_new = build_from_json(new_extract, directed=IS_DIRECTED)
if old_data:
G_old = json_graph.node_link_graph(old_data, edges='links')
diff = graph_diff(G_old, G_new)
print(diff['summary'])
if diff['new_nodes']:
print('New nodes:', ', '.join(n['label'] for n in diff['new_nodes'][:5]))
if diff['new_edges']:
print('New edges:', len(diff['new_edges']))
"
```
Before the merge step, save the old graph: `cp graphify-out/graph.json graphify-out/.graphify_old.json`
Clean up after: `rm -f graphify-out/.graphify_old.json`
---
## For --cluster-only
Skip Steps 13. Re-run clustering on the existing graph:
```bash
graphify cluster-only .
```
`graphify cluster-only .` is **self-contained**: it re-clusters, names communities, and regenerates `GRAPH_REPORT.md`, `graph.json`, and `graph.html` from the existing graph. **Do not re-run Steps 59** — they read intermediate files (`.graphify_extract.json`, `.graphify_detect.json`, `.graphify_analysis.json`) that a prior build's cleanup (Step 9) already deleted, so they raise `FileNotFoundError` (#1392). When it finishes, present the refreshed `GRAPH_REPORT.md` summary as usual.