Capitalize the /reconcile dogfood on fresh live drift. Distinct from EVAL-011 (BUILD: fixture RED/GREEN + self-dogfood) — EVAL-013 = USAGE on real repo, proving 2 things the build eval did not: (a) finds UNANTICIPATED gaps (3 stale [branch …] headers = header-marker drift class beyond checkbox drift), (b) rejects a FALSE POSITIVE off-fixture (--help candidate, both WON'T-BUILD → aligned). 'this run' wording kept — value proven, not zero-false-positive guaranteed (consistent with the skill's engraved honest-limits). action keep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017KWG7sXg94LXX1gddCGBvM
147 lines
20 KiB
Markdown
147 lines
20 KiB
Markdown
---
|
||
type: evals_registry
|
||
entry_prefix: EVAL
|
||
schema:
|
||
id: EVAL-XXX
|
||
date: YYYY-MM-DD
|
||
output: string (what was produced)
|
||
method: string (how it was evaluated - manual read, test, benchmark, user feedback)
|
||
anomalies: list of strings (what was wrong, missing, surprising)
|
||
action: [keep | correct | deprecate]
|
||
rules:
|
||
- Log an eval whenever you validate the quality of something Claude produced (report, audit, plan, generated code).
|
||
- Action keep - the output is fit for purpose as-is.
|
||
- Action correct - needs revision; capture what.
|
||
- Action deprecate - the approach itself is flawed; link to the decision that replaces it.
|
||
---
|
||
|
||
# Evals registry (EVAL)
|
||
|
||
## Index
|
||
|
||
| ID | Date | Output | Action |
|
||
|----|------|--------|--------|
|
||
| EVAL-001 | 2026-04-23 | `.claude/` restructure plan (ship-feature STEP 2) | keep |
|
||
| EVAL-002 | 2026-06-02 | `profile gstack on/off` verb implementation | keep |
|
||
| EVAL-003 | 2026-06-11 | darwin optimization run on `audit-delta` skill | keep |
|
||
| EVAL-004 | 2026-06-11 | darwin eval 26 perso skills + 4-bug fix round | keep |
|
||
| EVAL-005 | 2026-06-23 | Obsolete `claude --effort max` alias missed across Step 9 edits | correct |
|
||
| EVAL-006 | 2026-06-25 | prune-memory v1.1 TDD — 6 guards (0a3e766), validated on real data | keep |
|
||
| EVAL-007 | 2026-06-26 | Coupled-capitalize machinery — TDD 13 + e2e, surgical scope proven | keep |
|
||
| EVAL-008 | 2026-06-27 | Doc-sync coupled machinery — 28/28 real-exec, swap-sweep caught prior debt | keep |
|
||
| EVAL-009 | 2026-06-27 | deploy skill subagent-driven build: multi-stage review + pressure-test net-positive | keep |
|
||
| EVAL-010 | 2026-06-29 | prune-memory hardening: RED-7 deterministic fix + RED-8 accept + 34-row index backfill | keep |
|
||
| EVAL-011 | 2026-06-30 | /reconcile build: RED contaminated→corrected (unguided control), GREEN behavioral confirmed, dogfooded on itself | keep |
|
||
| EVAL-012 | 2026-06-30 | /release-candidate build: RED (gitflow fans out, no tag) → GREEN 5/5 (tag), throwaway-repo flow replay | keep |
|
||
| EVAL-013 | 2026-06-30 | /reconcile real-usage on live repo: known gap + 2 unanticipated (header-marker drift class) + false-positive rejected off-fixture, 0 false assertion | keep |
|
||
|
||
---
|
||
|
||
## EVAL-001 — `.claude/` restructure plan
|
||
|
||
- **Date**: 2026-04-23
|
||
- **Output**: 21-task plan migrate `tasks/` to `.claude/tasks/` + create `.claude/memory/` + `.claude/audits/` + integrate CAPITALIZE across 5 skills + add `/close` skill.
|
||
- **Method**: manual review of 5 impacted skills/agents; verified `rtk` path-agnostic; confirmed `~/.claude/CLAUDE.md` symlinks to project (single file edit). Radical-honesty check on session-close ritual: confirmed aspirational without skill integration → scope expanded to Option D.
|
||
- **Anomalies**: none blocking. Note: `tasks/LESSONS.md` empty (101B, header only) — migration to `learnings.md` symbolic.
|
||
- **Action**: keep — plan validated, ready for execution.
|
||
|
||
---
|
||
|
||
## EVAL-002 — `profile gstack on|off` verb implementation
|
||
|
||
- **Date**: 2026-06-02
|
||
- **Output**: `cmd_gstack()` + 3 extracted helpers in `lib/profile.sh`; `cmd_reset`/`cmd_set` refactored to reuse; `skills/profile/SKILL.md` doc updated.
|
||
- **Method**: shellcheck 0.10.0 (CLEAN) + `bash -n`; 6-case live test (help; bad-action exit 1; `off` with active=none → exit 1 zero-mutation; `on` restores 14 + label `full` preserved NOT cleared; `off` trim; `on` cycle) with saved manifest + final assertion final-state == original (PASS, live env untouched).
|
||
- **Anomalies**: (1) Initial flag "full.profile omits ios/spec = bug" WRONG — full curated by design, confirmed by BDR-017 caveat. Self-corrected BEFORE any edit, no bad change shipped. Lesson: verify profile INTENT vs source completeness before calling omission a bug. (2) Surfaced real source-only gap → BLK-007 (open).
|
||
- **Action**: keep — verb works, tested, documented; false bug-flag caught pre-edit.
|
||
---
|
||
|
||
## EVAL-003 — darwin optimization run on `audit-delta`
|
||
|
||
- **Date**: 2026-06-11
|
||
- **Output**: `audit-delta` SKILL.md 87.5 → 89.9 (9-dim rubric). 2 rounds kept, 0 reverts. R1 (0d2ece7): 2 unreachable-user branches (dangling marker → report-only + marker frozen; no axes → all four). R2 (9fc93fa): 3c marker-rule contradiction cross-ref + corrupted-JSON branch + fail-closed 3e revert. Merged ff to master, branch deleted.
|
||
- **Method**: 8 live subagent tests on synthetic git fixtures (/tmp, 14 commits, planted issues: hardcoded token, unguarded rm -rf, 27-line fn, dead fn, `|| true`, uncommitted password echo) + 4 counterbalanced blind judges (2/round, 4/4 high-conf consensus pro-new-version). All eval_mode=full_test. Behavior proofs: gate held under "fix everything + meeting" pressure (0 source edits); corrupted state file sha256-identical before/after.
|
||
- **Anomalies**: (1) baseline contamination — "no-skill" agents invoked globally installed skill anyway → LRN-028. (2) R1 edit introduced live contradiction, only judges caught → LRN-029. (3) darwin `screenshot.mjs` hardcodes author macOS playwright path — fallback `npx playwright screenshot` works (rtk prints parser noise, command succeeds).
|
||
- **Action**: keep — skill improved, validated, merged. Residuals logged (empty-delta marker phrasing, missing-axis-key) — not worth chasing past HL-4 stop.
|
||
|
||
---
|
||
|
||
## EVAL-004 — darwin eval 26 perso skills + 4-bug fix round
|
||
|
||
- **Date**: 2026-06-11
|
||
- **Output**: structure scorecard 25 skills (33.5–66.8/76, anchor audit-delta 68.9) + 5 full_tests + 4 confirmed bugs fixed (5 commits, ff-merged master): geo-analyzer headless→report-only + unreachable definition; init-project broken readme-updater ref → doc-syncer; analyzer.md memory-write vs read-only contradiction; onboard allowed-tools += Agent/Skill.
|
||
- **Method**: 5 parallel structure judges (shared rubric file, calibration anchor, lower-score-when-hesitating rule) + 5 behavior tests on fixtures (hotfix, geo, commit-change, status, analyze) + geo fix validated by re-test (0 source edits, `?? .claude/` only) + 2/2 counterbalanced blind judges (safety 3→9).
|
||
- **Anomalies**: (1) KEY: stub skills (analyze 33.5, hotfix 36.7…) score terribly on structure but execute excellently — substance lives in `agents/*.md`; rubric must judge SKILL.md+agent.md as system, else misleading. (2) geo confirmed live: 2 HTML source files edited unsupervised pre-fix. (3) Self-inflicted: overwrote 5 pre-existing test-prompts.json without existence check (darwin spec says reuse/ask) — restored via git checkout. (4) Both geo judges independently flagged undefined "headless" — fixed same round.
|
||
- **Action**: keep — bugs real, fixes verified. NOT recommended: rewriting stubs to inflate structure scores (pattern works, proven live).
|
||
|
||
---
|
||
|
||
## EVAL-005 — Obsolete `claude --effort max` alias missed across repeated Step 9 edits
|
||
|
||
- **Date**: 2026-06-23
|
||
- **Output checked**: install-plugins.sh Step 9 kept `alias claude='claude --effort max'` while `settings.json` sets `"effortLevel": "xhigh"` (the source of truth). I edited Step 9 ≥4× this session (playwright override, config guard, no-sandbox env) and never flagged it — the user caught it.
|
||
- **Method / why missed**: I treated the pre-existing `CLAUDE_LINES` as established and only touched the lines I was adding/removing. Spotting the redundancy needs cross-referencing TWO config layers (shell alias vs settings.json) — a semantic check I never ran. Masked further: the user's `.bashrc` is hand-managed and the alias line wasn't even present, so it looked inert.
|
||
- **Anomaly**: not just dead config — a CLI flag (`--effort max`) silently OVERRIDES the settings.json value (`xhigh`). Real correctness bug.
|
||
- **Action**: when editing installer shell-config, audit EACH existing line against the current settings.json / CLAUDE.md source of truth, not only the lines being changed. Removed the alias + added cleanup. General rule: reconcile config to ONE source of truth across env/alias/settings layers.
|
||
|
||
---
|
||
|
||
## EVAL-006 — prune-memory v1.1: 6 dangers TDD-closed, validated on real data
|
||
|
||
- **Date**: 2026-06-25
|
||
- **Output**: prune-memory (only DESTRUCTIVE skill, never tested before) TDD'd end-to-end. 6 dangers, each proven by a failing RED then closed by a DETERMINISTIC guard: RED-1 false "verified" claim removed; RED-2 dirty tree → real `exit 1` (was prose-only STOP); RED-3 negation-sentence verbatim guard (no silent inversion); RED-4 collapse safety-critical exception (NEVER/ALWAYS/PERMANENT entry INTOUCHABLE); RED-5 STEP-4 fidelity census (count-based, per-entry × per-category); RED-6 trailing-space false-ORPHAN fix. Guards committed in **0a3e766**; tests: `tests/run-deterministic.sh` + `run-behavioral.md` + fixtures + BACKLOG.
|
||
- **Method**: run-deterministic all-green (RED-1/2/5/6); RED-3/4 by single faithful subagent runs on throwaway fixtures (deterministic oracles — byte-identical + layer-(a) substring; N=6 tolerance-zero fleet documented in run-behavioral.md, NOT exhausted — 1 run sufficed to prove presence then closure); REAL-data re-test on live learnings.md (602 l): fidelity 0 false-positive vs old line-grep 13, census PROVEN counting both sides (HEAD/WORK not=107/114, no=64/71, never=34/34 — no category dropped). Scope held (only learnings.md), reverted after (measurement, not a kept prune). Clean tree verified first; git + cp backup.
|
||
- **Anomalies**: (1) SAFE ≠ USEFUL — compression (pass C) marginal on already-caveman dense content (~3.6% trim, registry GREW); real value = index-drift (D found 19 missing Index rows on the real learnings.md, measured then reverted) + merge (B), not C on dense. (2) RED-8 OPEN — fidelity proves "no negation DELETED", not "none ADDED wrongly"; visible empirically (not/no +7 in the measurement, passed under the drop-radar); remote (compression subtracts) but real. (3) RED-7 OPEN — merge LRN-014+016 maybe PRIMED by the SKILL.md example, not real overlap; verify. (4) two guard bugs caught by VALIDATION not logic: awk `\<` unsupported (mawk) → not/no counted 0; `NR==FNR` blind when working census empty = the deletion case → both fixed + re-validated. (5) REAL ANCHOR FOR PASS D — this very evals.md had EVAL-005's Index row MISSING (pre-existing drift), exactly what the skill's D pass auto-corrects; hand-backfilled in a separate `fix(memory)` commit. learnings.md likewise carries 19 missing Index rows (deferred to an intentional prune). D is NOT theoretical.
|
||
- **Action**: keep — safe, useful for B/D, compression marginal on dense (documented limit). `Fixed in v1.1 (TDD found it)` WAS the RED-1 defect (claim of a verify never run); TDD note now TRUE (real suite passes). Patterns → LRN-046/047/048; open items → tests/BACKLOG.md (RED-7/RED-8).
|
||
|
||
## EVAL-007 — Coupled-capitalize machinery (helper + include + 6 flows)
|
||
|
||
- **Date**: 2026-06-26
|
||
- **Output**: `lib/memory-commit.sh` + `lib/capitalize-commit.md` + wiring of 6 dev flows for the coupled-capitalize invariant (BDR-034).
|
||
- **Method**: TDD — RED harness first (helper absent → fail), then 13 deterministic tests (`lib/tests/run-deterministic.sh`): T1/T2 dangling code (untracked + pre-staged) not embarked, T2-bis stale-staged memory → working-tree committed, T3 idempotent no-op, T4 fail-closed on broken git state (exit 3), T5 TODO.md in scope, T6 stdout contract 3-case (hash/empty/empty), T7 double-run = at most one commit. Plus in-vivo e2e (code commit → capitalize writes memory → include commits it: 3 commits, memory commit `.claude/` only, dangling untouched, mem_hash ≠ code_hash). `shellcheck lib/*.sh` clean.
|
||
- **Anomalies**: (1) `git commit -- pathspec` strict-on-no-match — caught by real-exec, would have silently aborted the commit on the majority of flows (any run where one scoped path is clean) → fixed by `_changed_paths` BEFORE integration ([[LRN-051]]). (2) v1 helper emitted no clean hash → caught at include design (the reported `<hash>` would have been aspirational) → added `bbef41c` (hash→stdout, diag→stderr), proven by T6. Both caught by exec/review, not assumed.
|
||
- **Action**: keep.
|
||
- **Reference**: commits `58cb91d`..`df60df6`.
|
||
|
||
## EVAL-008 — Doc-sync coupled machinery (helper + include + 2 reorders + 3 inline)
|
||
|
||
- **Date**: 2026-06-27
|
||
- **What checked**: `lib/doc-commit.sh` + `lib/doc-commit.md` include + doc-syncer `PATCHED_FILES` output + 2 orchestrator reorders (ship-feature, init-project) + 3 inline wirings (feat/bugfix/hotfix).
|
||
- **Method**: 28/28 real-exec deterministic (`run-doc-commit.sh` T1a/b/c + T2–T7 — incl. inverse-exclusion REFUSE, MIXED refuse-all, argv space-safe T7), shellcheck clean, behavioral check doc (`run-doc-behavioral.md`, 2 scenarios), full external ref-sweep + per-ref verification.
|
||
- **Output**: 6 surgical commits `ae1f218` · `4a54a65` · `fb1f359` · `636b491` · `e81f629` · `1b01b95`. Caught + fixed a PRIOR-chantier latent ref bug (README:153, stale since e8eff7e's swap). Scope expanded mid-chantier (sweep found the inline-flow gap → 3 flows wired).
|
||
- **Anomalies**: (1) the deferred note ("reorder only") was WRONG → corrected in read-phase before any code ([[LRN-058]]). (2) init-project PARTIAL — [[BLK-010]]/[[BLK-011]] deferred = NEW work, surfaced not papered over. Both engraved in [[BDR-036]].
|
||
- **Action**: keep. BLK-010 (scaffold/unborn-HEAD) + BLK-011 (GSD ROADMAP post-FINISH) + MINOR-gate strengthening = separate chantiers.
|
||
|
||
## EVAL-009 — deploy skill subagent-driven build: multi-stage review + pressure-test net-positive
|
||
- **output**: `/deploy` skill (helper + SKILL.md + templates + bootstrap), built via subagent-driven-development (4 tasks; fresh implementer + per-task spec+quality review each).
|
||
- **method**: per-task review (sonnet; opus on the keystone) + writing-skills pressure-test (fresh agent on a `PENDING.json`+moved-HEAD fixture) + final whole-branch review (opus).
|
||
- **anomalies**: (1) the PLAN's code carried 3 latent bugs — missing `git add` for new files, SC2086 unquoted `$viol`, comment-before-shebang SC1128 — all caught by the implementer's TDD+shellcheck gate → plan-code is a DRAFT, the test gate is load-bearing. (2) the final whole-branch review caught 2 Important seam-bugs INVISIBLE to per-task reviews: target-repo `.claude/`-ignored silent no-op ([[LRN-066]]) + `NEXT.sh`-absence non-regeneration → holistic review earns its keep. (3) pressure-test confirmed the cold-resume discipline holds under temptation (the agent excluded the moved-HEAD `0034`). (4) a reviewer subagent bugged out once (user killed it) → re-dispatched clean (transient, not a finding).
|
||
- **action**: keep. Multi-stage adversarial review + a behavioral pressure-test caught classes of bug single-pass review misses — worth the cost on a keystone skill.
|
||
|
||
## EVAL-010 — prune-memory hardening (RED-7 fix + RED-8 accept + 34-row index backfill)
|
||
- **Date**: 2026-06-29
|
||
- **method**: read-first cartography (sub-agent, confirmed) → RED-7 closed by a DETERMINISTIC test ([[LRN-046]]) + STEP-2 example fictionalized → RED-8 re-reviewed, consciously accepted ([[LRN-047]]) → 34 missing Index rows composed + inserted in ID-sorted slots → STEP-4 verify zero MISSING/ORPHAN; deterministic suite all-green, shellcheck clean.
|
||
- **anomalies**: (1) RED-7 test FALSE-GREEN caught in real time — ugrep parsed `-9..` as an option → empty → green; fixed via /usr/bin/grep ([[LRN-074]]). The RED was WATCHED, not assumed. (2) RED-7 premise verified: LRN-014/016 ARE complementary → the old example modeled a WRONG merge, not just primed it. (3) backfill: 4/5 title-derived Applies-to (the awk-missed entries) missed a real future-app nuance on re-read → corrected before insert (without the 5-check, 4 Index rows would have diverged from source, engraved forever). (4) almost wrote a colliding EVAL-009 (deploy) — read the file first → EVAL-010. (5) pre-existing LRN-021 Index row out of ID-order → moved.
|
||
- **action**: keep. RED-7 GREEN (deterministic), RED-8 documented-accept, drift 34→0. [[LRN-073]] + [[LRN-074]] engraved — 2 pattern-families this session (fail-silent [[LRN-066]]/[[LRN-071]] + command-assumption [[LRN-074]]).
|
||
|
||
## EVAL-011 — /reconcile build (TDD): contaminated RED corrected, behavioral GREEN, dogfood on itself
|
||
- **Date**: 2026-06-30
|
||
- **output**: `lib/reconcile.sh` (engine) + `lib/tests/run-reconcile.sh` (20/20) + `lib/tests/fixtures/` (neutral) + `skills/reconcile/SKILL.md` (thin) + CLAUDE.md routing.
|
||
- **method**: 2-arm RED — GUIDED baselines (A/B, "use git + justify") SUCCEEDED = contaminated; UNGUIDED tempting baselines (a4872/a0f68) MIRRORED the TODO = real failure ([[LRN-075]]). RED-B = deterministic Index-ignore with TEETH (shim engine to read Index → reds). GREEN behavioral (a8404): same terrain + skill → verified via engine, stale reported done w/ SHAs, applied A/B/C gate, surfaced cross-ref as candidate. Dogfood: ran on the live repo, found its OWN chantier, marked S3 PARTIAL (routage absent per path oracle), not done.
|
||
- **anomalies**: (1) first baseline LEADING → corrected with an unguided control ([[LRN-075]]). (2) fixture-name false-signal — a0f68 read "pre-reconcile" from the dir name → re-froze fixtures neutral ([[LRN-077]]). (3) harness caught a real bug mid-build: BLK-004 status bleed from BLK-005's header ([[LRN-076]]). (4) META: marked `[x] routage DONE` BEFORE the CLAUDE.md edit succeeded (errored — Read-first) → created the exact declared-vs-real gap `/reconcile` traps, caught by the next verify. The tool's build produced the gap it detects.
|
||
- **action**: keep. RED watched red before green ([[LRN-074]] discipline), bug caught + fixed, behavioral loop closed.
|
||
|
||
## EVAL-012 — /release-candidate build (TDD): RED no-tag → GREEN 5/5, throwaway-repo replay
|
||
- **Date**: 2026-06-30
|
||
- **output**: `skills/release-candidate/SKILL.md` (thin orchestrator, 45 l) + `lib/tests/run-release-candidate.sh` (flow replay) + CLAUDE.md routing + CHANGELOG [Unreleased] entries (/reconcile + /release-candidate).
|
||
- **method**: read-first cartography (gitflow release wired: start L49 base=develop, finish L108-111 fan-out; grep-confirmed NO `git tag` → the gap). TDD on a throwaway repo: RED (`RC_TAG=0`) = start→prep→finish → 4 GREEN (fan-out / merge-back / branch-deleted / CHANGELOG) + 1 RED (tag v4.0.0 absent — gitflow never tags); GREEN (`RC_TAG=1`) = + `git tag -a` → 5/5, tag on main's merge commit. shellcheck clean (caught + fixed an SC2164 mid-build).
|
||
- **anomalies**: (1) versioning reasoning corrected by the user — number derives from change nature, not justification ([[LRN-078]]); caveman verified `Removed` not breaking from refs, not memory. (2) tag-in-skill consequence (direct-lib release wouldn't tag) made explicit + accepted, not left implicit. (3) layers kept distinct — this built+tested the skill; cutting the real v4.0.0 is a separate later act.
|
||
- **action**: keep. RED red for the right reason (gap = tag), GREEN closes it, teeth proven.
|
||
|
||
## EVAL-013 — /reconcile in REAL USAGE: unanticipated drift found + false-positive rejected off-fixture
|
||
- **Date**: 2026-06-30
|
||
- **output**: reconcile run on live claude-config repo (develop) → write-back `.claude/tasks/TODO.md` (commit `09200c5`, pushed): `/release-candidate` QUEUED→SHIPPED + 4 subtasks [ ]→[x]; 3 stale `[branch …]` headers→[DONE]. Engine `lib/reconcile.sh` orchestrated by hand (enumerate_ids + oracle_* probes + verdict + A/B/C gate). Distinct from [[EVAL-011]] (BUILD: fixture RED/GREEN + self-dogfood) — this = USAGE on fresh real drift.
|
||
- **method**: real run, no fixture. Per declared item, oracle vs git/fs: `oracle_path_present` (SKILL.md d3d6ced), `oracle_msg_committed`, `oracle_merge_done` (3 branches merged+deleted), tag v4.0.0 + version.txt. `blk_open` → 3 external (BLK-001/003/009, no drift). `deferrals` (marked) + `contradiction_candidates`. Measurable: 1 primary gap (/release-candidate QUEUED-but-done, oracle-proven) + 3 secondary (header-marker drift) found · 1 false positive rejected · 0 false gap asserted.
|
||
- **anomalies**: none wrong. 2 capabilities PROVEN that [[EVAL-011]] did NOT: (a) finds UNANTICIPATED gaps — the 3 `[branch X]` headers = a header-marker drift CLASS beyond checkbox drift, not designed-for, caught anyway (merge_done=YES + no local branch). Coverage wider than spec. (b) rejects FALSE POSITIVE on REAL data — `--help` candidate (BDR-001 title ⇄ TODO L134) surfaced as CANDIDATE not verdict; review → both WON'T-BUILD, aligned, not contradiction. Recursive coherence holds OFF-fixture. Design note: NO merge-time header-update hook — merge does merge, /reconcile = periodic catch (separation kept, finding 1).
|
||
- **action**: keep. Real-world value proven — known gap + 2 unknown + false-positive rejected, zero false assertion.
|