claude/.claude/memory/evals.md
Bastien Chanot 437697e961 chore(memory): EVAL-013 — /reconcile real-usage value proven (usage vs build)
Capitalize the /reconcile dogfood on fresh live drift. Distinct from EVAL-011
(BUILD: fixture RED/GREEN + self-dogfood) — EVAL-013 = USAGE on real repo,
proving 2 things the build eval did not: (a) finds UNANTICIPATED gaps (3 stale
[branch …] headers = header-marker drift class beyond checkbox drift), (b) rejects
a FALSE POSITIVE off-fixture (--help candidate, both WON'T-BUILD → aligned).
'this run' wording kept — value proven, not zero-false-positive guaranteed
(consistent with the skill's engraved honest-limits). action keep.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017KWG7sXg94LXX1gddCGBvM
2026-06-30 17:49:01 +02:00

147 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
type: evals_registry
entry_prefix: EVAL
schema:
id: EVAL-XXX
date: YYYY-MM-DD
output: string (what was produced)
method: string (how it was evaluated - manual read, test, benchmark, user feedback)
anomalies: list of strings (what was wrong, missing, surprising)
action: [keep | correct | deprecate]
rules:
- Log an eval whenever you validate the quality of something Claude produced (report, audit, plan, generated code).
- Action keep - the output is fit for purpose as-is.
- Action correct - needs revision; capture what.
- Action deprecate - the approach itself is flawed; link to the decision that replaces it.
---
# Evals registry (EVAL)
## Index
| ID | Date | Output | Action |
|----|------|--------|--------|
| EVAL-001 | 2026-04-23 | `.claude/` restructure plan (ship-feature STEP 2) | keep |
| EVAL-002 | 2026-06-02 | `profile gstack on/off` verb implementation | keep |
| EVAL-003 | 2026-06-11 | darwin optimization run on `audit-delta` skill | keep |
| EVAL-004 | 2026-06-11 | darwin eval 26 perso skills + 4-bug fix round | keep |
| EVAL-005 | 2026-06-23 | Obsolete `claude --effort max` alias missed across Step 9 edits | correct |
| EVAL-006 | 2026-06-25 | prune-memory v1.1 TDD — 6 guards (0a3e766), validated on real data | keep |
| EVAL-007 | 2026-06-26 | Coupled-capitalize machinery — TDD 13 + e2e, surgical scope proven | keep |
| EVAL-008 | 2026-06-27 | Doc-sync coupled machinery — 28/28 real-exec, swap-sweep caught prior debt | keep |
| EVAL-009 | 2026-06-27 | deploy skill subagent-driven build: multi-stage review + pressure-test net-positive | keep |
| EVAL-010 | 2026-06-29 | prune-memory hardening: RED-7 deterministic fix + RED-8 accept + 34-row index backfill | keep |
| EVAL-011 | 2026-06-30 | /reconcile build: RED contaminated→corrected (unguided control), GREEN behavioral confirmed, dogfooded on itself | keep |
| EVAL-012 | 2026-06-30 | /release-candidate build: RED (gitflow fans out, no tag) → GREEN 5/5 (tag), throwaway-repo flow replay | keep |
| EVAL-013 | 2026-06-30 | /reconcile real-usage on live repo: known gap + 2 unanticipated (header-marker drift class) + false-positive rejected off-fixture, 0 false assertion | keep |
---
## EVAL-001 — `.claude/` restructure plan
- **Date**: 2026-04-23
- **Output**: 21-task plan migrate `tasks/` to `.claude/tasks/` + create `.claude/memory/` + `.claude/audits/` + integrate CAPITALIZE across 5 skills + add `/close` skill.
- **Method**: manual review of 5 impacted skills/agents; verified `rtk` path-agnostic; confirmed `~/.claude/CLAUDE.md` symlinks to project (single file edit). Radical-honesty check on session-close ritual: confirmed aspirational without skill integration → scope expanded to Option D.
- **Anomalies**: none blocking. Note: `tasks/LESSONS.md` empty (101B, header only) — migration to `learnings.md` symbolic.
- **Action**: keep — plan validated, ready for execution.
---
## EVAL-002 — `profile gstack on|off` verb implementation
- **Date**: 2026-06-02
- **Output**: `cmd_gstack()` + 3 extracted helpers in `lib/profile.sh`; `cmd_reset`/`cmd_set` refactored to reuse; `skills/profile/SKILL.md` doc updated.
- **Method**: shellcheck 0.10.0 (CLEAN) + `bash -n`; 6-case live test (help; bad-action exit 1; `off` with active=none → exit 1 zero-mutation; `on` restores 14 + label `full` preserved NOT cleared; `off` trim; `on` cycle) with saved manifest + final assertion final-state == original (PASS, live env untouched).
- **Anomalies**: (1) Initial flag "full.profile omits ios/spec = bug" WRONG — full curated by design, confirmed by BDR-017 caveat. Self-corrected BEFORE any edit, no bad change shipped. Lesson: verify profile INTENT vs source completeness before calling omission a bug. (2) Surfaced real source-only gap → BLK-007 (open).
- **Action**: keep — verb works, tested, documented; false bug-flag caught pre-edit.
---
## EVAL-003 — darwin optimization run on `audit-delta`
- **Date**: 2026-06-11
- **Output**: `audit-delta` SKILL.md 87.5 → 89.9 (9-dim rubric). 2 rounds kept, 0 reverts. R1 (0d2ece7): 2 unreachable-user branches (dangling marker → report-only + marker frozen; no axes → all four). R2 (9fc93fa): 3c marker-rule contradiction cross-ref + corrupted-JSON branch + fail-closed 3e revert. Merged ff to master, branch deleted.
- **Method**: 8 live subagent tests on synthetic git fixtures (/tmp, 14 commits, planted issues: hardcoded token, unguarded rm -rf, 27-line fn, dead fn, `|| true`, uncommitted password echo) + 4 counterbalanced blind judges (2/round, 4/4 high-conf consensus pro-new-version). All eval_mode=full_test. Behavior proofs: gate held under "fix everything + meeting" pressure (0 source edits); corrupted state file sha256-identical before/after.
- **Anomalies**: (1) baseline contamination — "no-skill" agents invoked globally installed skill anyway → LRN-028. (2) R1 edit introduced live contradiction, only judges caught → LRN-029. (3) darwin `screenshot.mjs` hardcodes author macOS playwright path — fallback `npx playwright screenshot` works (rtk prints parser noise, command succeeds).
- **Action**: keep — skill improved, validated, merged. Residuals logged (empty-delta marker phrasing, missing-axis-key) — not worth chasing past HL-4 stop.
---
## EVAL-004 — darwin eval 26 perso skills + 4-bug fix round
- **Date**: 2026-06-11
- **Output**: structure scorecard 25 skills (33.566.8/76, anchor audit-delta 68.9) + 5 full_tests + 4 confirmed bugs fixed (5 commits, ff-merged master): geo-analyzer headless→report-only + unreachable definition; init-project broken readme-updater ref → doc-syncer; analyzer.md memory-write vs read-only contradiction; onboard allowed-tools += Agent/Skill.
- **Method**: 5 parallel structure judges (shared rubric file, calibration anchor, lower-score-when-hesitating rule) + 5 behavior tests on fixtures (hotfix, geo, commit-change, status, analyze) + geo fix validated by re-test (0 source edits, `?? .claude/` only) + 2/2 counterbalanced blind judges (safety 3→9).
- **Anomalies**: (1) KEY: stub skills (analyze 33.5, hotfix 36.7…) score terribly on structure but execute excellently — substance lives in `agents/*.md`; rubric must judge SKILL.md+agent.md as system, else misleading. (2) geo confirmed live: 2 HTML source files edited unsupervised pre-fix. (3) Self-inflicted: overwrote 5 pre-existing test-prompts.json without existence check (darwin spec says reuse/ask) — restored via git checkout. (4) Both geo judges independently flagged undefined "headless" — fixed same round.
- **Action**: keep — bugs real, fixes verified. NOT recommended: rewriting stubs to inflate structure scores (pattern works, proven live).
---
## EVAL-005 — Obsolete `claude --effort max` alias missed across repeated Step 9 edits
- **Date**: 2026-06-23
- **Output checked**: install-plugins.sh Step 9 kept `alias claude='claude --effort max'` while `settings.json` sets `"effortLevel": "xhigh"` (the source of truth). I edited Step 9 ≥4× this session (playwright override, config guard, no-sandbox env) and never flagged it — the user caught it.
- **Method / why missed**: I treated the pre-existing `CLAUDE_LINES` as established and only touched the lines I was adding/removing. Spotting the redundancy needs cross-referencing TWO config layers (shell alias vs settings.json) — a semantic check I never ran. Masked further: the user's `.bashrc` is hand-managed and the alias line wasn't even present, so it looked inert.
- **Anomaly**: not just dead config — a CLI flag (`--effort max`) silently OVERRIDES the settings.json value (`xhigh`). Real correctness bug.
- **Action**: when editing installer shell-config, audit EACH existing line against the current settings.json / CLAUDE.md source of truth, not only the lines being changed. Removed the alias + added cleanup. General rule: reconcile config to ONE source of truth across env/alias/settings layers.
---
## EVAL-006 — prune-memory v1.1: 6 dangers TDD-closed, validated on real data
- **Date**: 2026-06-25
- **Output**: prune-memory (only DESTRUCTIVE skill, never tested before) TDD'd end-to-end. 6 dangers, each proven by a failing RED then closed by a DETERMINISTIC guard: RED-1 false "verified" claim removed; RED-2 dirty tree → real `exit 1` (was prose-only STOP); RED-3 negation-sentence verbatim guard (no silent inversion); RED-4 collapse safety-critical exception (NEVER/ALWAYS/PERMANENT entry INTOUCHABLE); RED-5 STEP-4 fidelity census (count-based, per-entry × per-category); RED-6 trailing-space false-ORPHAN fix. Guards committed in **0a3e766**; tests: `tests/run-deterministic.sh` + `run-behavioral.md` + fixtures + BACKLOG.
- **Method**: run-deterministic all-green (RED-1/2/5/6); RED-3/4 by single faithful subagent runs on throwaway fixtures (deterministic oracles — byte-identical + layer-(a) substring; N=6 tolerance-zero fleet documented in run-behavioral.md, NOT exhausted — 1 run sufficed to prove presence then closure); REAL-data re-test on live learnings.md (602 l): fidelity 0 false-positive vs old line-grep 13, census PROVEN counting both sides (HEAD/WORK not=107/114, no=64/71, never=34/34 — no category dropped). Scope held (only learnings.md), reverted after (measurement, not a kept prune). Clean tree verified first; git + cp backup.
- **Anomalies**: (1) SAFE ≠ USEFUL — compression (pass C) marginal on already-caveman dense content (~3.6% trim, registry GREW); real value = index-drift (D found 19 missing Index rows on the real learnings.md, measured then reverted) + merge (B), not C on dense. (2) RED-8 OPEN — fidelity proves "no negation DELETED", not "none ADDED wrongly"; visible empirically (not/no +7 in the measurement, passed under the drop-radar); remote (compression subtracts) but real. (3) RED-7 OPEN — merge LRN-014+016 maybe PRIMED by the SKILL.md example, not real overlap; verify. (4) two guard bugs caught by VALIDATION not logic: awk `\<` unsupported (mawk) → not/no counted 0; `NR==FNR` blind when working census empty = the deletion case → both fixed + re-validated. (5) REAL ANCHOR FOR PASS D — this very evals.md had EVAL-005's Index row MISSING (pre-existing drift), exactly what the skill's D pass auto-corrects; hand-backfilled in a separate `fix(memory)` commit. learnings.md likewise carries 19 missing Index rows (deferred to an intentional prune). D is NOT theoretical.
- **Action**: keep — safe, useful for B/D, compression marginal on dense (documented limit). `Fixed in v1.1 (TDD found it)` WAS the RED-1 defect (claim of a verify never run); TDD note now TRUE (real suite passes). Patterns → LRN-046/047/048; open items → tests/BACKLOG.md (RED-7/RED-8).
## EVAL-007 — Coupled-capitalize machinery (helper + include + 6 flows)
- **Date**: 2026-06-26
- **Output**: `lib/memory-commit.sh` + `lib/capitalize-commit.md` + wiring of 6 dev flows for the coupled-capitalize invariant (BDR-034).
- **Method**: TDD — RED harness first (helper absent → fail), then 13 deterministic tests (`lib/tests/run-deterministic.sh`): T1/T2 dangling code (untracked + pre-staged) not embarked, T2-bis stale-staged memory → working-tree committed, T3 idempotent no-op, T4 fail-closed on broken git state (exit 3), T5 TODO.md in scope, T6 stdout contract 3-case (hash/empty/empty), T7 double-run = at most one commit. Plus in-vivo e2e (code commit → capitalize writes memory → include commits it: 3 commits, memory commit `.claude/` only, dangling untouched, mem_hash ≠ code_hash). `shellcheck lib/*.sh` clean.
- **Anomalies**: (1) `git commit -- pathspec` strict-on-no-match — caught by real-exec, would have silently aborted the commit on the majority of flows (any run where one scoped path is clean) → fixed by `_changed_paths` BEFORE integration ([[LRN-051]]). (2) v1 helper emitted no clean hash → caught at include design (the reported `<hash>` would have been aspirational) → added `bbef41c` (hash→stdout, diag→stderr), proven by T6. Both caught by exec/review, not assumed.
- **Action**: keep.
- **Reference**: commits `58cb91d`..`df60df6`.
## EVAL-008 — Doc-sync coupled machinery (helper + include + 2 reorders + 3 inline)
- **Date**: 2026-06-27
- **What checked**: `lib/doc-commit.sh` + `lib/doc-commit.md` include + doc-syncer `PATCHED_FILES` output + 2 orchestrator reorders (ship-feature, init-project) + 3 inline wirings (feat/bugfix/hotfix).
- **Method**: 28/28 real-exec deterministic (`run-doc-commit.sh` T1a/b/c + T2T7 — incl. inverse-exclusion REFUSE, MIXED refuse-all, argv space-safe T7), shellcheck clean, behavioral check doc (`run-doc-behavioral.md`, 2 scenarios), full external ref-sweep + per-ref verification.
- **Output**: 6 surgical commits `ae1f218` · `4a54a65` · `fb1f359` · `636b491` · `e81f629` · `1b01b95`. Caught + fixed a PRIOR-chantier latent ref bug (README:153, stale since e8eff7e's swap). Scope expanded mid-chantier (sweep found the inline-flow gap → 3 flows wired).
- **Anomalies**: (1) the deferred note ("reorder only") was WRONG → corrected in read-phase before any code ([[LRN-058]]). (2) init-project PARTIAL — [[BLK-010]]/[[BLK-011]] deferred = NEW work, surfaced not papered over. Both engraved in [[BDR-036]].
- **Action**: keep. BLK-010 (scaffold/unborn-HEAD) + BLK-011 (GSD ROADMAP post-FINISH) + MINOR-gate strengthening = separate chantiers.
## EVAL-009 — deploy skill subagent-driven build: multi-stage review + pressure-test net-positive
- **output**: `/deploy` skill (helper + SKILL.md + templates + bootstrap), built via subagent-driven-development (4 tasks; fresh implementer + per-task spec+quality review each).
- **method**: per-task review (sonnet; opus on the keystone) + writing-skills pressure-test (fresh agent on a `PENDING.json`+moved-HEAD fixture) + final whole-branch review (opus).
- **anomalies**: (1) the PLAN's code carried 3 latent bugs — missing `git add` for new files, SC2086 unquoted `$viol`, comment-before-shebang SC1128 — all caught by the implementer's TDD+shellcheck gate → plan-code is a DRAFT, the test gate is load-bearing. (2) the final whole-branch review caught 2 Important seam-bugs INVISIBLE to per-task reviews: target-repo `.claude/`-ignored silent no-op ([[LRN-066]]) + `NEXT.sh`-absence non-regeneration → holistic review earns its keep. (3) pressure-test confirmed the cold-resume discipline holds under temptation (the agent excluded the moved-HEAD `0034`). (4) a reviewer subagent bugged out once (user killed it) → re-dispatched clean (transient, not a finding).
- **action**: keep. Multi-stage adversarial review + a behavioral pressure-test caught classes of bug single-pass review misses — worth the cost on a keystone skill.
## EVAL-010 — prune-memory hardening (RED-7 fix + RED-8 accept + 34-row index backfill)
- **Date**: 2026-06-29
- **method**: read-first cartography (sub-agent, confirmed) → RED-7 closed by a DETERMINISTIC test ([[LRN-046]]) + STEP-2 example fictionalized → RED-8 re-reviewed, consciously accepted ([[LRN-047]]) → 34 missing Index rows composed + inserted in ID-sorted slots → STEP-4 verify zero MISSING/ORPHAN; deterministic suite all-green, shellcheck clean.
- **anomalies**: (1) RED-7 test FALSE-GREEN caught in real time — ugrep parsed `-9..` as an option → empty → green; fixed via /usr/bin/grep ([[LRN-074]]). The RED was WATCHED, not assumed. (2) RED-7 premise verified: LRN-014/016 ARE complementary → the old example modeled a WRONG merge, not just primed it. (3) backfill: 4/5 title-derived Applies-to (the awk-missed entries) missed a real future-app nuance on re-read → corrected before insert (without the 5-check, 4 Index rows would have diverged from source, engraved forever). (4) almost wrote a colliding EVAL-009 (deploy) — read the file first → EVAL-010. (5) pre-existing LRN-021 Index row out of ID-order → moved.
- **action**: keep. RED-7 GREEN (deterministic), RED-8 documented-accept, drift 34→0. [[LRN-073]] + [[LRN-074]] engraved — 2 pattern-families this session (fail-silent [[LRN-066]]/[[LRN-071]] + command-assumption [[LRN-074]]).
## EVAL-011 — /reconcile build (TDD): contaminated RED corrected, behavioral GREEN, dogfood on itself
- **Date**: 2026-06-30
- **output**: `lib/reconcile.sh` (engine) + `lib/tests/run-reconcile.sh` (20/20) + `lib/tests/fixtures/` (neutral) + `skills/reconcile/SKILL.md` (thin) + CLAUDE.md routing.
- **method**: 2-arm RED — GUIDED baselines (A/B, "use git + justify") SUCCEEDED = contaminated; UNGUIDED tempting baselines (a4872/a0f68) MIRRORED the TODO = real failure ([[LRN-075]]). RED-B = deterministic Index-ignore with TEETH (shim engine to read Index → reds). GREEN behavioral (a8404): same terrain + skill → verified via engine, stale reported done w/ SHAs, applied A/B/C gate, surfaced cross-ref as candidate. Dogfood: ran on the live repo, found its OWN chantier, marked S3 PARTIAL (routage absent per path oracle), not done.
- **anomalies**: (1) first baseline LEADING → corrected with an unguided control ([[LRN-075]]). (2) fixture-name false-signal — a0f68 read "pre-reconcile" from the dir name → re-froze fixtures neutral ([[LRN-077]]). (3) harness caught a real bug mid-build: BLK-004 status bleed from BLK-005's header ([[LRN-076]]). (4) META: marked `[x] routage DONE` BEFORE the CLAUDE.md edit succeeded (errored — Read-first) → created the exact declared-vs-real gap `/reconcile` traps, caught by the next verify. The tool's build produced the gap it detects.
- **action**: keep. RED watched red before green ([[LRN-074]] discipline), bug caught + fixed, behavioral loop closed.
## EVAL-012 — /release-candidate build (TDD): RED no-tag → GREEN 5/5, throwaway-repo replay
- **Date**: 2026-06-30
- **output**: `skills/release-candidate/SKILL.md` (thin orchestrator, 45 l) + `lib/tests/run-release-candidate.sh` (flow replay) + CLAUDE.md routing + CHANGELOG [Unreleased] entries (/reconcile + /release-candidate).
- **method**: read-first cartography (gitflow release wired: start L49 base=develop, finish L108-111 fan-out; grep-confirmed NO `git tag` → the gap). TDD on a throwaway repo: RED (`RC_TAG=0`) = start→prep→finish → 4 GREEN (fan-out / merge-back / branch-deleted / CHANGELOG) + 1 RED (tag v4.0.0 absent — gitflow never tags); GREEN (`RC_TAG=1`) = + `git tag -a` → 5/5, tag on main's merge commit. shellcheck clean (caught + fixed an SC2164 mid-build).
- **anomalies**: (1) versioning reasoning corrected by the user — number derives from change nature, not justification ([[LRN-078]]); caveman verified `Removed` not breaking from refs, not memory. (2) tag-in-skill consequence (direct-lib release wouldn't tag) made explicit + accepted, not left implicit. (3) layers kept distinct — this built+tested the skill; cutting the real v4.0.0 is a separate later act.
- **action**: keep. RED red for the right reason (gap = tag), GREEN closes it, teeth proven.
## EVAL-013 — /reconcile in REAL USAGE: unanticipated drift found + false-positive rejected off-fixture
- **Date**: 2026-06-30
- **output**: reconcile run on live claude-config repo (develop) → write-back `.claude/tasks/TODO.md` (commit `09200c5`, pushed): `/release-candidate` QUEUED→SHIPPED + 4 subtasks [ ]→[x]; 3 stale `[branch …]` headers→[DONE]. Engine `lib/reconcile.sh` orchestrated by hand (enumerate_ids + oracle_* probes + verdict + A/B/C gate). Distinct from [[EVAL-011]] (BUILD: fixture RED/GREEN + self-dogfood) — this = USAGE on fresh real drift.
- **method**: real run, no fixture. Per declared item, oracle vs git/fs: `oracle_path_present` (SKILL.md d3d6ced), `oracle_msg_committed`, `oracle_merge_done` (3 branches merged+deleted), tag v4.0.0 + version.txt. `blk_open` → 3 external (BLK-001/003/009, no drift). `deferrals` (marked) + `contradiction_candidates`. Measurable: 1 primary gap (/release-candidate QUEUED-but-done, oracle-proven) + 3 secondary (header-marker drift) found · 1 false positive rejected · 0 false gap asserted.
- **anomalies**: none wrong. 2 capabilities PROVEN that [[EVAL-011]] did NOT: (a) finds UNANTICIPATED gaps — the 3 `[branch X]` headers = a header-marker drift CLASS beyond checkbox drift, not designed-for, caught anyway (merge_done=YES + no local branch). Coverage wider than spec. (b) rejects FALSE POSITIVE on REAL data — `--help` candidate (BDR-001 title ⇄ TODO L134) surfaced as CANDIDATE not verdict; review → both WON'T-BUILD, aligned, not contradiction. Recursive coherence holds OFF-fixture. Design note: NO merge-time header-update hook — merge does merge, /reconcile = periodic catch (separation kept, finding 1).
- **action**: keep. Real-world value proven — known gap + 2 unknown + false-positive rejected, zero false assertion.