diff --git a/.claude/memory/decisions.md b/.claude/memory/decisions.md index f1ce242..b16da54 100644 --- a/.claude/memory/decisions.md +++ b/.claude/memory/decisions.md @@ -41,6 +41,7 @@ rules: | BDR-017 | 2026-05-18 | `full` profile = web-full + plan + dev superset for /init-project MVP | accepted | | BDR-018 | 2026-06-02 | `profile gstack on/off` verb — toggle gstack keeping active-profile label | accepted | | BDR-019 | 2026-06-09 | Remove `disable-model-invocation` repo-wide — align skills with CLAUDE.md routing | accepted | +| BDR-020 | 2026-06-11 | `/audit-delta`: per-axis SHA markers + always-on fix gate + unreachable-first-run = full report-only | accepted | --- @@ -352,3 +353,31 @@ rules: - Remove only the 8 `true` ones — rejected: leaves 11 noise `false` lines; uniform removal cleaner. - **Durability**: all 8 ex-`true` skills are repo-only files (not gstack submodule) → edits not clobbered on gstack upgrade. - **Reference**: 18 `skills/*/SKILL.md` modified + `skills/capitalize/` new. Linked to [[disable-model-invocation-false-not-blocking]] (LRN-026). + +--- + +## BDR-020 — `/audit-delta` design: per-axis SHA markers, always-on fix gate, unreachable-first-run = full report-only + +- **Date**: 2026-06-11 +- **Status**: accepted +- **Decision**: New skill `skills/audit-delta/SKILL.md` — recurring multi-axis audit (conformity/errors/deadcode/security) scoped to delta since last run. 3 design choices: (1) state = `.claude/audits/audit-delta-state.json`, SHA marker PER AXIS (partial runs would desync single marker); (2) approval gate per axis ALWAYS fires — advance pre-auth ("fix what you find") never skips it, findings unknown at request time; user unreachable → audit + report only, no fix, marker still advances; (3) first-run axis + unreachable user → default full codebase report-only, never "from HEAD" (would skip entire existing codebase silently). Axis order fixed security→errors→conformity→deadcode (critical first, session-death safe). Re-verify = same-axis re-audit on modified files + project checks, lint alone insufficient. Built via superpowers:writing-skills TDD (RED 7 gaps / GREEN pass under pressure / REFACTOR 1 hole patched + re-tested). +- **Alternatives rejected**: + - Extend `/code-clean` or `/health` — rejected: no recurrence state (health re-scans all, tracks scores not scope; code-clean one-shot), no multi-axis checkbox selection, cost not proportional to delta. + - 4 separate skills (1 per axis) — rejected: user wants checkbox combo in one run; shared marker protocol + gate + re-verify loop would quadruplicate. + - Single global marker — rejected: run "security only" then "conformity" → conformity range wrong. + - Date-based boundary — rejected: drifts on rebase/timezone/amend (baseline agent failure, see LRN-027). +- **Reference**: `skills/audit-delta/SKILL.md`. Linked to [[periodic-skill-state-file]] (LRN-027), [[capitalize-skill]] (skill TDD precedent, BDR-019 era). + +--- + +## BDR-021 — CLAUDE.md restructure: contradiction purge, project-specific sections labeled, critical sections never compressed + +- **Date**: 2026-06-12 +- **Status**: accepted +- **Decision**: Full refactor global CLAUDE.md (commit e7e9dac), Fable 5 audit. 4 contradictions resolved (2 graphify sections merged conditional on graph.json existing; "in doubt skip plan" no longer cancels plan mandate — borderline = single-file small obvious change; deviations minor/justified→after vs significant/shaky→before; append-only reconciled with /prune-memory). 3 dead refs fixed (/caveman-compress, design-gate → ~/.claude/lib/ portable, LESSONS note). Structure: Tooling & skills + "This repo only" top-level sections — Health Stack/routing/graphify no longer nested under Communication mode. Routing +8 skills + explicit gstack-OFF rule. Compression caveman on workflow/memory/routing ONLY: **Architecture decisions + Security stay verbatim — ambiguity there costs more than tokens saved**. Net -1471 chars despite added content. +- **Alternatives rejected**: + - Compress whole file incl. Security/Architecture — rejected: precision > tokens on non-negotiable rules; misread security default = real damage. + - Split global vs repo-specific into 2 files — rejected: symlink setup (~/.claude/CLAUDE.md → repo) means 1 file everywhere; "This repo only" section header cheaper than 2-file sync. + - Delete graphify section (graph.json absent) — rejected: conditional phrasing keeps rules dormant-but-ready; regenerating graph re-activates without doc edit. +- **Durability**: heading "Design work — full toolchain (tiered by scope)" preserved verbatim — design-toolchain-reminder.sh quotes it. Norms 25/80/5/5 unchanged — audit-delta conformity axis cites them. +- **Reference**: CLAUDE.md, commit e7e9dac. Linked to [[audit-delta-design]] (BDR-020), LRN-029 (exception edits need blanket-rule cross-ref — applied here). diff --git a/.claude/memory/evals.md b/.claude/memory/evals.md index e426a2f..fd33eed 100644 --- a/.claude/memory/evals.md +++ b/.claude/memory/evals.md @@ -23,6 +23,8 @@ rules: |----|------|--------|--------| | EVAL-001 | 2026-04-23 | `.claude/` restructure plan (ship-feature STEP 2) | keep | | EVAL-002 | 2026-06-02 | `profile gstack on/off` verb implementation | keep | +| EVAL-003 | 2026-06-11 | darwin optimization run on `audit-delta` skill | keep | +| EVAL-004 | 2026-06-11 | darwin eval 26 perso skills + 4-bug fix round | keep | --- @@ -42,4 +44,23 @@ rules: - **Output**: `cmd_gstack()` + 3 extracted helpers in `lib/profile.sh`; `cmd_reset`/`cmd_set` refactored to reuse; `skills/profile/SKILL.md` doc updated. - **Method**: shellcheck 0.10.0 (CLEAN) + `bash -n`; 6-case live test (help; bad-action exit 1; `off` with active=none → exit 1 zero-mutation; `on` restores 14 + label `full` preserved NOT cleared; `off` trim; `on` cycle) with saved manifest + final assertion final-state == original (PASS, live env untouched). - **Anomalies**: (1) Initial flag "full.profile omits ios/spec = bug" WRONG — full curated by design, confirmed by BDR-017 caveat. Self-corrected BEFORE any edit, no bad change shipped. Lesson: verify profile INTENT vs source completeness before calling omission a bug. (2) Surfaced real source-only gap → BLK-007 (open). -- **Action**: keep — verb works, tested, documented; false bug-flag caught pre-edit. \ No newline at end of file +- **Action**: keep — verb works, tested, documented; false bug-flag caught pre-edit. +--- + +## EVAL-003 — darwin optimization run on `audit-delta` + +- **Date**: 2026-06-11 +- **Output**: `audit-delta` SKILL.md 87.5 → 89.9 (9-dim rubric). 2 rounds kept, 0 reverts. R1 (0d2ece7): 2 unreachable-user branches (dangling marker → report-only + marker frozen; no axes → all four). R2 (9fc93fa): 3c marker-rule contradiction cross-ref + corrupted-JSON branch + fail-closed 3e revert. Merged ff to master, branch deleted. +- **Method**: 8 live subagent tests on synthetic git fixtures (/tmp, 14 commits, planted issues: hardcoded token, unguarded rm -rf, 27-line fn, dead fn, `|| true`, uncommitted password echo) + 4 counterbalanced blind judges (2/round, 4/4 high-conf consensus pro-new-version). All eval_mode=full_test. Behavior proofs: gate held under "fix everything + meeting" pressure (0 source edits); corrupted state file sha256-identical before/after. +- **Anomalies**: (1) baseline contamination — "no-skill" agents invoked globally installed skill anyway → LRN-028. (2) R1 edit introduced live contradiction, only judges caught → LRN-029. (3) darwin `screenshot.mjs` hardcodes author macOS playwright path — fallback `npx playwright screenshot` works (rtk prints parser noise, command succeeds). +- **Action**: keep — skill improved, validated, merged. Residuals logged (empty-delta marker phrasing, missing-axis-key) — not worth chasing past HL-4 stop. + +--- + +## EVAL-004 — darwin eval 26 perso skills + 4-bug fix round + +- **Date**: 2026-06-11 +- **Output**: structure scorecard 25 skills (33.5–66.8/76, anchor audit-delta 68.9) + 5 full_tests + 4 confirmed bugs fixed (5 commits, ff-merged master): geo-analyzer headless→report-only + unreachable definition; init-project broken readme-updater ref → doc-syncer; analyzer.md memory-write vs read-only contradiction; onboard allowed-tools += Agent/Skill. +- **Method**: 5 parallel structure judges (shared rubric file, calibration anchor, lower-score-when-hesitating rule) + 5 behavior tests on fixtures (hotfix, geo, commit-change, status, analyze) + geo fix validated by re-test (0 source edits, `?? .claude/` only) + 2/2 counterbalanced blind judges (safety 3→9). +- **Anomalies**: (1) KEY: stub skills (analyze 33.5, hotfix 36.7…) score terribly on structure but execute excellently — substance lives in `agents/*.md`; rubric must judge SKILL.md+agent.md as system, else misleading. (2) geo confirmed live: 2 HTML source files edited unsupervised pre-fix. (3) Self-inflicted: overwrote 5 pre-existing test-prompts.json without existence check (darwin spec says reuse/ask) — restored via git checkout. (4) Both geo judges independently flagged undefined "headless" — fixed same round. +- **Action**: keep — bugs real, fixes verified. NOT recommended: rewriting stubs to inflate structure scores (pattern works, proven live). diff --git a/.claude/memory/journal.md b/.claude/memory/journal.md index 701406a..7b99993 100644 --- a/.claude/memory/journal.md +++ b/.claude/memory/journal.md @@ -146,3 +146,15 @@ rules: - Baseline-tested per superpowers:writing-skills: RED (no skill) double-logged one incident across LRN+BLK; GREEN (skill) passed clean on isolated fixture (2 new written, 2 dups dropped, trivial skipped, correct IDs, append-only). REFACTOR added "one incident → one primary registry" counter. Dedup half inconclusive (toy fixture eyeball-able — value shows at real registry scale). - Removed `disable-model-invocation` from all 19 editable skills (8 `true` blocked model+orchestrator routing incl `ship-feature`; 11 `false` were no-op noise). Aligns with CLAUDE.md routing — model/orchestrator can now self-route. Conceded own wrong "destructive" framing; real guard = careful/guard hooks. - BDR-019 + LRN-026 capitalized. + +## 2026-06-11 + +- Built `/audit-delta` skill (`skills/audit-delta/`) — recurring multi-axis audit (conformity CLAUDE.md / errors / deadcode / security), checkbox selection, scope = delta since last run via per-axis SHA markers in `.claude/audits/audit-delta-state.json`. Per axis: read-only audit → approval gate → fix → mandatory re-verify (same-axis re-audit + project checks) → marker advance. Answered user need: no existing skill covered "since last run" (health re-scans all, retro time-window, code-review branch-only). +- TDD per superpowers:writing-skills, 4 worktree-isolated subagent tests: RED baseline 7 gaps (file-date boundary guess, prose checkpoint, single marker, no gate under "fix + meeting" pressure, lint=verify, mixed pass, auto registry writes); GREEN passed under same pressure (gate held, 0 fixes); REFACTOR found + patched unreachable-first-run hole (default full report-only, never from-HEAD); re-test pass. Worktrees cleaned. +- BDR-020 + LRN-027 capitalized. Uncommitted — /commit-change pending. +- Darwin run on `audit-delta`: 87.5 → 89.9, 2 rounds kept (0d2ece7 unreachable-user branches, 9fc93fa contradiction + corrupted-JSON + fail-closed revert), 8 live fixture tests + 4/4 blind-judge consensus, HL-4 stop, ff-merged to master. Result card generated. LRN-028 (baseline contamination) + LRN-029 (judges catch self-review misses) + EVAL-003 capitalized. +- Darwin eval 26 perso skills: 5 judges structure (33.5–66.8/76), 5 full_tests. Stubs score low but execute great (substance in agents/*.md) — judge system not file. 4 confirmed bugs fixed + merged (geo headless gate ★, init-project broken ref, analyzer contradiction, onboard frontmatter); geo re-test 0 source edits, judges 2/2. Overwrote 5 existing test-prompts.json by mistake — restored. EVAL-004. + +## 2026-06-12 + +- Fable 5 audit global CLAUDE.md → refactor e7e9dac: 4 contradictions (graphify x2 stale, plan-skip, deviations, append-only), 3 dead refs, restructure (Tooling & skills + This-repo-only sections), routing +8 skills + gstack-OFF rule, caveman compress non-critical only (-1471 chars net). Security/Architecture verbatim by design. BDR-021. diff --git a/.claude/memory/learnings.md b/.claude/memory/learnings.md index 6797f02..46f9688 100644 --- a/.claude/memory/learnings.md +++ b/.claude/memory/learnings.md @@ -43,6 +43,7 @@ rules: | LRN-024 | 2026-06-02 | New sibling command sharing logic → extract helper + refactor existing caller, never copy-paste; assert pre/post state equality | adding a subcommand/branch reusing logic inline in a peer command | | LRN-025 | 2026-06-02 | `.gitignore` gstack allowlist must cover ALL toggleable skills (incl. parked) — else enabling one = untracked git noise | any toggle that moves local-symlink skills into a tracked dir; post-submodule-bump reconcile | | LRN-026 | 2026-06-09 | `disable-model-invocation: false` = ENABLED not blocking; only `true` blocks (model + orchestrator); binary, no per-caller | Claude Code skill frontmatter; deciding self-route/chain vs human-only entry point | +| LRN-027 | 2026-06-11 | Agents improvise audit boundaries from file dates when no machine state — periodic skills need machine-readable state file, never inference | any recurring/periodic skill needing "since last run" semantics | --- @@ -394,3 +395,36 @@ rules: - **Why matters**: two traps. (1) Adding `disable-model-invocation: false` thinking you block invocation — you don't, it's a no-op noise line. (2) Keeping `true` "for safety" on a skill you actually want orchestrators to chain (e.g. `ship-feature`, `refactor`) — silently breaks your own CLAUDE.md routing; the model sees the intent but can't fire. Real destructive-action safety = careful/guard hooks (block `rm -rf`/force-push live), INDEPENDENT of this flag — so `true` on an orchestrator buys ~0 data-safety, only suppresses auto-fire (token/time cost). - **Applies to**: any Claude Code skill frontmatter. Want skill model-routable + orchestrator-chainable → omit key (or `false`). Want human-only `/command` entry point → `true`, accepting it also blocks orchestrators. Guard genuinely dangerous ops at the hook layer, not via this flag. - **Reference**: BDR-019, 19 `skills/*/SKILL.md`. Linked to [[remove-disable-model-invocation-repowide]] (BDR-019). + +--- + +## LRN-027 — Periodic "since last run" skill needs machine-readable state file — agents improvise boundaries from file dates otherwise + +- **Date**: 2026-06-11 +- **Context**: TDD baseline for `/audit-delta` (superpowers:writing-skills RED phase, isolated worktree, no skill). Agent asked to "audit everything changed since last audit run". No recorded state → agent guessed boundary from most recent file mtime/date in `.claude/audits/` (grabbed `DARWIN-SKILL-2026-05-12.md` — darwin report, not audit checkpoint), used `git log --after=` (date-based, drifts on rebase/timezone/amend), then wrote ITS checkpoint as prose inside dated report — next run must guess again, same failure loop. Also: zero approval gate under "fix what you find + I'm in meeting" pressure, shellcheck-pass called "verified", all axes one mixed pass. +- **Pattern**: any recurring skill with "since last run" semantics MUST persist machine-readable state (JSON, SHA-based, per-dimension if partial runs possible) + skill must FORBID inference fallbacks explicitly ("do NOT scan report dates", "no `--after`"). Baseline agents fill state vacuum with plausible-wrong heuristics, confidently. +- **Why matters**: improvised boundary = wrong scope silently. Date boundaries break on rebase. Prose checkpoints unparseable. Single marker desyncs partial runs. +- **Applies to**: future periodic skills (audit, sync, drift-check, recurring reports). Design state file FIRST, write anti-inference rules in skill body. +- **Reference**: `skills/audit-delta/SKILL.md` STEP 0 + Common mistakes table. Linked to [[audit-delta-design]] (BDR-020). + +--- + +## LRN-028 — "No-skill" subagent baselines invalid when skill installed globally — subagents see + invoke installed skills + +- **Date**: 2026-06-11 +- **Context**: darwin run on `audit-delta`. 3 baseline subagents (prompt without skill) meant as no-skill control. All 3 followed skill protocol anyway — one report said "Invoked the /audit-delta skill". Skill symlinked in `~/.claude/skills/` → auto-listed in every subagent's available-skills → "baseline" = contaminated, differential comparison dead. +- **Pattern**: control condition must REMOVE capability, not omit mention. Globally installed skills leak into all subagents. True baseline: fixture env with skill uninstalled/renamed, or isolated worktree pre-install (how audit-delta's own TDD RED phase did it — only valid baseline evidence that run). +- **Detect**: baseline report cites skill name / follows its exact protocol → contaminated. +- **Applies to**: darwin dim8 with/without tests, any A/B skill eval, TDD RED baselines. +- **Reference**: darwin results.tsv 2026-06-11 baseline row. Linked to [[audit-delta-design]] (BDR-020), LRN-027. + +--- + +## LRN-029 — Edit adding exception to blanket rule WILL contradict it — counterbalanced blind judges catch what self-review misses + +- **Date**: 2026-06-11 +- **Context**: darwin Round 1 added STEP 0 exception (dangling marker → marker frozen) to `audit-delta`. Pre-existing 3c blanket rule ("unreachable user → marker still updates") now contradicted it. Self-review missed; 4/4 independent blind judges (2 per round, doc order swapped to kill position bias) flagged the live contradiction. Round 2 fixed via explicit cross-ref exception clause in 3c. +- **Pattern**: (1) any edit adding exception → grep doc for blanket rules covering same variable (here: marker updates), cross-ref or contradict. (2) Judge protocol that works: 2+ judges, A/B order counterbalanced, blind to version age, score named dims, require consensus. SkillLens 46.4% solo-judge accuracy is real — consensus + counterbalance compensates. +- **Why matters**: improvement edits create inconsistency debt invisible to author in same context (darwin blacklist #1). +- **Applies to**: skill/doc/spec edits adding branches; any self-modified artifact scoring. +- **Reference**: commits 0d2ece7 (introduced), 9fc93fa (fixed). Linked to LRN-027. diff --git a/.claude/tasks/TODO.md b/.claude/tasks/TODO.md index 6f78e27..fb50a9c 100644 --- a/.claude/tasks/TODO.md +++ b/.claude/tasks/TODO.md @@ -166,3 +166,29 @@ Subtasks : - [x] Tests : `set web` enable ui-ux-pro-max+magic, `set seo` disable ui-ux-pro-max, `set minimal` épargne always-on, `reset` restaure 64 skills - [x] Memoire : BDR-008 (v2 décision) + journal entry 2026-05-04 - [x] Shellcheck propre + +## /audit-delta — skill audit incrémental multi-axes (2026-06-11) +But : 1 skill, 4 axes cochables (conformité CLAUDE.md, erreurs/améliorations, +code mort, sécurité), scope = diff depuis dernier run (marqueur SHA persistant, +par axe), boucle par axe : audit → gate approbation → fix → re-vérification +obligatoire avant axe suivant. Construit via superpowers:writing-skills (TDD). +- [x] RED : baseline subagent sans skill (worktree isolé) — 7 gaps documentés + (boundary par date de fichier, checkpoint en prose, pas de marqueur par + axe, zéro gate, lint=verify, passe unique mélangée, registres auto-écrits) +- [x] GREEN : skills/audit-delta/SKILL.md — pass sous pression (state file + utilisé, gate tenu malgré "fix tout + meeting", marqueurs par axe OK) +- [x] REFACTOR : trou trouvé (premier run + user injoignable, aucune règle) → + patch : défaut full codebase report-only, jamais "from HEAD" ; re-test pass +- [x] Vérif finale : skill découvrable (~/.claude/skills/audit-delta via symlink + skills/), frontmatter valide, worktrees de test nettoyés +- [x] Capitalize : BDR-020 + LRN-027 + journal 2026-06-11 +- [ ] Commit (via /commit-change quand prêt) + +## 2026-06-11 — darwin eval: 4 confirmed bugs fix (branch auto-optimize/*-bugfixes) + +- [x] geo-analyzer.md: unreachable user → ALL file fixes report-only (STEP 12/13 triage gate) +- [x] init-project SKILL.md: repoint readme-updater.md (absent) → doc-syncer.md x2 +- [x] analyzer.md: resolve "Update project memory" vs "Do not modify files" contradiction +- [x] onboard SKILL.md: allowed-tools += Agent, Skill (workflow STEPs 5-7 need them) +- [x] re-test geo fixture (unreachable) → expect zero source edits; 2 blind judges on geo-analyzer diff +- [x] commit per fix, results.tsv rows, merge if green