Capitalize the doc-sync coupled chantier: BDR-036 (the invariant, 3 honesties engraved — built-not-reordered, MINOR non-gated surface-replaces-gate, init-project partial + sweep scope-expansion); LRN-058 (same bug-class != same fix — verify the twin's precondition); LRN-059 (swap flips meanings, sweep caught prior-chantier debt README:153 != letter-suffix insertion); LRN-060 (fail-closed guard proven by what it refuses, loudly; argv not separator-string); EVAL-008 (28/28 real-exec, anomalies surfaced). Journal 2026-06-27. BLK-010/011 flags + the frozen plan + TODO checkmarks. BLK-011 record left at STEP 13 (append-only); only the TODO locator moved to STEP 12 (live locator vs immutable record). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ho5EQCFTSvYamuRtVZpp2d
12 KiB
12 KiB
| type | entry_prefix | schema | rules | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| evals_registry | EVAL |
|
|
Evals registry (EVAL)
Index
| ID | Date | Output | Action |
|---|---|---|---|
| EVAL-001 | 2026-04-23 | .claude/ restructure plan (ship-feature STEP 2) |
keep |
| EVAL-002 | 2026-06-02 | profile gstack on/off verb implementation |
keep |
| EVAL-003 | 2026-06-11 | darwin optimization run on audit-delta skill |
keep |
| EVAL-004 | 2026-06-11 | darwin eval 26 perso skills + 4-bug fix round | keep |
| EVAL-005 | 2026-06-23 | Obsolete claude --effort max alias missed across Step 9 edits |
correct |
| EVAL-006 | 2026-06-25 | prune-memory v1.1 TDD — 6 guards (0a3e766), validated on real data |
keep |
| EVAL-007 | 2026-06-26 | Coupled-capitalize machinery — TDD 13 + e2e, surgical scope proven | keep |
| EVAL-008 | 2026-06-27 | Doc-sync coupled machinery — 28/28 real-exec, swap-sweep caught prior debt | keep |
EVAL-001 — .claude/ restructure plan
- Date: 2026-04-23
- Output: 21-task plan migrate
tasks/to.claude/tasks/+ create.claude/memory/+.claude/audits/+ integrate CAPITALIZE across 5 skills + add/closeskill. - Method: manual review of 5 impacted skills/agents; verified
rtkpath-agnostic; confirmed~/.claude/CLAUDE.mdsymlinks to project (single file edit). Radical-honesty check on session-close ritual: confirmed aspirational without skill integration → scope expanded to Option D. - Anomalies: none blocking. Note:
tasks/LESSONS.mdempty (101B, header only) — migration tolearnings.mdsymbolic. - Action: keep — plan validated, ready for execution.
EVAL-002 — profile gstack on|off verb implementation
- Date: 2026-06-02
- Output:
cmd_gstack()+ 3 extracted helpers inlib/profile.sh;cmd_reset/cmd_setrefactored to reuse;skills/profile/SKILL.mddoc updated. - Method: shellcheck 0.10.0 (CLEAN) +
bash -n; 6-case live test (help; bad-action exit 1;offwith active=none → exit 1 zero-mutation;onrestores 14 + labelfullpreserved NOT cleared;offtrim;oncycle) with saved manifest + final assertion final-state == original (PASS, live env untouched). - Anomalies: (1) Initial flag "full.profile omits ios/spec = bug" WRONG — full curated by design, confirmed by BDR-017 caveat. Self-corrected BEFORE any edit, no bad change shipped. Lesson: verify profile INTENT vs source completeness before calling omission a bug. (2) Surfaced real source-only gap → BLK-007 (open).
- Action: keep — verb works, tested, documented; false bug-flag caught pre-edit.
EVAL-003 — darwin optimization run on audit-delta
- Date: 2026-06-11
- Output:
audit-deltaSKILL.md 87.5 → 89.9 (9-dim rubric). 2 rounds kept, 0 reverts. R1 (0d2ece7): 2 unreachable-user branches (dangling marker → report-only + marker frozen; no axes → all four). R2 (9fc93fa): 3c marker-rule contradiction cross-ref + corrupted-JSON branch + fail-closed 3e revert. Merged ff to master, branch deleted. - Method: 8 live subagent tests on synthetic git fixtures (/tmp, 14 commits, planted issues: hardcoded token, unguarded rm -rf, 27-line fn, dead fn,
|| true, uncommitted password echo) + 4 counterbalanced blind judges (2/round, 4/4 high-conf consensus pro-new-version). All eval_mode=full_test. Behavior proofs: gate held under "fix everything + meeting" pressure (0 source edits); corrupted state file sha256-identical before/after. - Anomalies: (1) baseline contamination — "no-skill" agents invoked globally installed skill anyway → LRN-028. (2) R1 edit introduced live contradiction, only judges caught → LRN-029. (3) darwin
screenshot.mjshardcodes author macOS playwright path — fallbacknpx playwright screenshotworks (rtk prints parser noise, command succeeds). - Action: keep — skill improved, validated, merged. Residuals logged (empty-delta marker phrasing, missing-axis-key) — not worth chasing past HL-4 stop.
EVAL-004 — darwin eval 26 perso skills + 4-bug fix round
- Date: 2026-06-11
- Output: structure scorecard 25 skills (33.5–66.8/76, anchor audit-delta 68.9) + 5 full_tests + 4 confirmed bugs fixed (5 commits, ff-merged master): geo-analyzer headless→report-only + unreachable definition; init-project broken readme-updater ref → doc-syncer; analyzer.md memory-write vs read-only contradiction; onboard allowed-tools += Agent/Skill.
- Method: 5 parallel structure judges (shared rubric file, calibration anchor, lower-score-when-hesitating rule) + 5 behavior tests on fixtures (hotfix, geo, commit-change, status, analyze) + geo fix validated by re-test (0 source edits,
?? .claude/only) + 2/2 counterbalanced blind judges (safety 3→9). - Anomalies: (1) KEY: stub skills (analyze 33.5, hotfix 36.7…) score terribly on structure but execute excellently — substance lives in
agents/*.md; rubric must judge SKILL.md+agent.md as system, else misleading. (2) geo confirmed live: 2 HTML source files edited unsupervised pre-fix. (3) Self-inflicted: overwrote 5 pre-existing test-prompts.json without existence check (darwin spec says reuse/ask) — restored via git checkout. (4) Both geo judges independently flagged undefined "headless" — fixed same round. - Action: keep — bugs real, fixes verified. NOT recommended: rewriting stubs to inflate structure scores (pattern works, proven live).
EVAL-005 — Obsolete claude --effort max alias missed across repeated Step 9 edits
- Date: 2026-06-23
- Output checked: install-plugins.sh Step 9 kept
alias claude='claude --effort max'whilesettings.jsonsets"effortLevel": "xhigh"(the source of truth). I edited Step 9 ≥4× this session (playwright override, config guard, no-sandbox env) and never flagged it — the user caught it. - Method / why missed: I treated the pre-existing
CLAUDE_LINESas established and only touched the lines I was adding/removing. Spotting the redundancy needs cross-referencing TWO config layers (shell alias vs settings.json) — a semantic check I never ran. Masked further: the user's.bashrcis hand-managed and the alias line wasn't even present, so it looked inert. - Anomaly: not just dead config — a CLI flag (
--effort max) silently OVERRIDES the settings.json value (xhigh). Real correctness bug. - Action: when editing installer shell-config, audit EACH existing line against the current settings.json / CLAUDE.md source of truth, not only the lines being changed. Removed the alias + added cleanup. General rule: reconcile config to ONE source of truth across env/alias/settings layers.
EVAL-006 — prune-memory v1.1: 6 dangers TDD-closed, validated on real data
- Date: 2026-06-25
- Output: prune-memory (only DESTRUCTIVE skill, never tested before) TDD'd end-to-end. 6 dangers, each proven by a failing RED then closed by a DETERMINISTIC guard: RED-1 false "verified" claim removed; RED-2 dirty tree → real
exit 1(was prose-only STOP); RED-3 negation-sentence verbatim guard (no silent inversion); RED-4 collapse safety-critical exception (NEVER/ALWAYS/PERMANENT entry INTOUCHABLE); RED-5 STEP-4 fidelity census (count-based, per-entry × per-category); RED-6 trailing-space false-ORPHAN fix. Guards committed in0a3e766; tests:tests/run-deterministic.sh+run-behavioral.md+ fixtures + BACKLOG. - Method: run-deterministic all-green (RED-1/2/5/6); RED-3/4 by single faithful subagent runs on throwaway fixtures (deterministic oracles — byte-identical + layer-(a) substring; N=6 tolerance-zero fleet documented in run-behavioral.md, NOT exhausted — 1 run sufficed to prove presence then closure); REAL-data re-test on live learnings.md (602 l): fidelity 0 false-positive vs old line-grep 13, census PROVEN counting both sides (HEAD/WORK not=107/114, no=64/71, never=34/34 — no category dropped). Scope held (only learnings.md), reverted after (measurement, not a kept prune). Clean tree verified first; git + cp backup.
- Anomalies: (1) SAFE ≠ USEFUL — compression (pass C) marginal on already-caveman dense content (~3.6% trim, registry GREW); real value = index-drift (D found 19 missing Index rows on the real learnings.md, measured then reverted) + merge (B), not C on dense. (2) RED-8 OPEN — fidelity proves "no negation DELETED", not "none ADDED wrongly"; visible empirically (not/no +7 in the measurement, passed under the drop-radar); remote (compression subtracts) but real. (3) RED-7 OPEN — merge LRN-014+016 maybe PRIMED by the SKILL.md example, not real overlap; verify. (4) two guard bugs caught by VALIDATION not logic: awk
\<unsupported (mawk) → not/no counted 0;NR==FNRblind when working census empty = the deletion case → both fixed + re-validated. (5) REAL ANCHOR FOR PASS D — this very evals.md had EVAL-005's Index row MISSING (pre-existing drift), exactly what the skill's D pass auto-corrects; hand-backfilled in a separatefix(memory)commit. learnings.md likewise carries 19 missing Index rows (deferred to an intentional prune). D is NOT theoretical. - Action: keep — safe, useful for B/D, compression marginal on dense (documented limit).
Fixed in v1.1 (TDD found it)WAS the RED-1 defect (claim of a verify never run); TDD note now TRUE (real suite passes). Patterns → LRN-046/047/048; open items → tests/BACKLOG.md (RED-7/RED-8).
EVAL-007 — Coupled-capitalize machinery (helper + include + 6 flows)
- Date: 2026-06-26
- Output:
lib/memory-commit.sh+lib/capitalize-commit.md+ wiring of 6 dev flows for the coupled-capitalize invariant (BDR-034). - Method: TDD — RED harness first (helper absent → fail), then 13 deterministic tests (
lib/tests/run-deterministic.sh): T1/T2 dangling code (untracked + pre-staged) not embarked, T2-bis stale-staged memory → working-tree committed, T3 idempotent no-op, T4 fail-closed on broken git state (exit 3), T5 TODO.md in scope, T6 stdout contract 3-case (hash/empty/empty), T7 double-run = at most one commit. Plus in-vivo e2e (code commit → capitalize writes memory → include commits it: 3 commits, memory commit.claude/only, dangling untouched, mem_hash ≠ code_hash).shellcheck lib/*.shclean. - Anomalies: (1)
git commit -- pathspecstrict-on-no-match — caught by real-exec, would have silently aborted the commit on the majority of flows (any run where one scoped path is clean) → fixed by_changed_pathsBEFORE integration (LRN-051). (2) v1 helper emitted no clean hash → caught at include design (the reported<hash>would have been aspirational) → addedbbef41c(hash→stdout, diag→stderr), proven by T6. Both caught by exec/review, not assumed. - Action: keep.
- Reference: commits
58cb91d..df60df6.
EVAL-008 — Doc-sync coupled machinery (helper + include + 2 reorders + 3 inline)
- Date: 2026-06-27
- What checked:
lib/doc-commit.sh+lib/doc-commit.mdinclude + doc-syncerPATCHED_FILESoutput + 2 orchestrator reorders (ship-feature, init-project) + 3 inline wirings (feat/bugfix/hotfix). - Method: 28/28 real-exec deterministic (
run-doc-commit.shT1a/b/c + T2–T7 — incl. inverse-exclusion REFUSE, MIXED refuse-all, argv space-safe T7), shellcheck clean, behavioral check doc (run-doc-behavioral.md, 2 scenarios), full external ref-sweep + per-ref verification. - Output: 6 surgical commits
ae1f218·4a54a65·fb1f359·636b491·e81f629·1b01b95. Caught + fixed a PRIOR-chantier latent ref bug (README:153, stale since e8eff7e's swap). Scope expanded mid-chantier (sweep found the inline-flow gap → 3 flows wired). - Anomalies: (1) the deferred note ("reorder only") was WRONG → corrected in read-phase before any code (LRN-058). (2) init-project PARTIAL — BLK-010/BLK-011 deferred = NEW work, surfaced not papered over. Both engraved in BDR-036.
- Action: keep. BLK-010 (scaffold/unborn-HEAD) + BLK-011 (GSD ROADMAP post-FINISH) + MINOR-gate strengthening = separate chantiers.