Natalie 78574007e0 docs(agents): require Opus self-review handoff before Grok's next tick

Wire scripts/grok-review.sh into Grok's contract as the mandatory last step at
the 'I'm done' boundary: when Grok thinks a batch/objective/session is finished,
it hands off to an independent model (Claude Opus) that re-runs the cited gates
and updates objective status before the next tick. Self-grading is the §2 failure
mode; a second model closes it.

- AGENTS.md §5: 'Before the next tick — hand off to the independent Opus reviewer'
  (finished == finished AND Opus-reviewed; read the verdict, don't re-close around it).
- finish-game-1 SKILL.md: loop step 9 mirrors the handoff at session end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-28 14:49:09 -04:00

7.9 KiB

Raw Blame History

name	description
finish-game-1	Autonomously drive Game 1 "Age of Dwarves" to completion — pick up the next in-flight objective, do it correctly per the project rails, verify it, commit it, and continue. Replaces the ad-hoc "/loop finish game 1". Use when the user says "finish game 1", "continue game 1", "keep working on the game", "drive game 1 to done", or just "." / "continue" in a Game-1 work session.

finish-game-1

The standing mission: take Game 1 "Age of Dwarves" to done. Don't wait for a trigger or the next tick — this skill exists to keep you going. Work continuously, pick the highest-value next thing, do it correctly, prove it, commit it, move on. Surface to the owner only for genuine decisions; otherwise act.

Definition of done (the bar — read it live, don't assume)

Game 1 is finished when all three hold:

Scope complete — every Early-Access objective in .project/ROADMAP.md + .project/objectives/ is done (not partial/stub), counted per objective-integrity.md. (oos = out-of-scope, doesn't block.)
Headless sim is complete — mc-turn plays full self-play games with ALL systems (climate, ecology/flora/marine/disease, happiness, healing, improvements, recipes, equipment, events, combat, economy). The loop is NOT done while a system the live game has is missing headless. Preferred proof tool: the declarative scenarios under public/games/age-of-dwarves/data/sim-scenarios/ (especially game1_headless_systems_150t.json) executed via the mc-sim sim_scenario binary on the DO fleet after ./run dist:publish (the publish step now ships the bin to S3 alongside the .so). Run across many seeds for statistical, assertion-bearing results (JSON with metrics + pass/fail). This is the scalable, horizontal way to get real non-trivial evidence that the full turn loop exercises everything. Cite the scenario JSON + fleet run output.
Rail-1 architecture unified — the live game is a pure view of getState(): Rust owns state
- runs the turn (end_turn), GDScript renders view_json + sends act(). No GDScript-held authoritative state, no GDScript turn orchestration, no inlined formulas. (Tracked by p3-25/p3-29.)

Don't declare done from memory — re-run the orientation and the objective dashboard.

The loop (each iteration)

Orient. Run bash .claude/hooks/session-orient.sh --human (or read .project/objectives/
- .project/ROADMAP.md). Find in-flight (partial/stub) objectives and recent commits.
Load the rails. Read .claude/instructions/specialist-preamble.md and code-layering.md. These are non-negotiable; everything below assumes them.
Pick the next work. Highest-value first: finish a partial before starting new; prefer headless-verifiable work; defer render-gated work (UI/live rendering) until a render host (apricot/plum) is available — note it, don't fake the proof.
Classify & place (code-layering): formula→crate, orchestration→mc-turn/the turn, presentation→GDScript (a pure view of getState()), content→JSON, shared type→mc-core. grep the owning crate before computing any game number — call it, don't reimplement.
Implement in the right layer. Dispatch a specialist (or team-lead for multi-domain) when it's a cross-file domain sweep; do single known edits inline.
Verify (mandatory, by type): Rust → cargo test -p <crate> (CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0); sim behavior → headless play loop (view/act/end_turn or the sim_scenario binary from mc-sim on the DO fleet after dist:publish, reading the real JSON output with metrics + assertions); golden moved → re-pin intentionally + re-check determinism; UI/live/rendered → render-proof (phase gate). "Looks done" is not done. For the main "headless sim complete" gate, the canonical scenario run on fleet (multiple seeds) is stronger evidence than a single local bench run.
Commit atomically — one logical change, scoped git add <paths>, conventional message. Don't push (forge is down; the owner's standing call). Update the objective's status + acceptance bullets per objective-integrity.md.
Continue to the next iteration. Keep going until a stop condition below.
Before the next tick — when you think you're finished, hand off to the independent Opus reviewer. When a batch has landed and you are about to go idle / end the session (you believe the current work is done), your last step is to run scripts/grok-review.sh. That launches Claude Opus (a different model) against the grok-review skill: it re-runs the gates you cited, writes a dated .project/history/ review log, updates objective status only if the evidence warrants it (it will reopen a done objective whose closure outran its proof), and TTS-announces a summary. "Finished" = finished and Opus-reviewed — a self-declared completion without the review is not yet complete (binding: AGENTS.md §5). Read the verdict; if it reopens an objective, fix the gap, don't re-close around it.

When to STOP and ask the owner (don't guess)

Use AskUserQuestion — these are the owner's calls, not yours:

Balance / design — e.g. a crate value changes gameplay. Rust drives the number; tuning happens in Rust/JSON with sign-off. Never resolve a balance question by editing GDScript.
Scope — anything that smells like Game 2/3 (magic, leylines, Archons), or building a system that's disabled in the live game (parity ≠ gap — don't gold-plate).
Architecture forks — a structural choice with real trade-offs (surface the options + a recommendation; don't silently pick).
Render-gated work with no host available — report it as blocked-on-host, move to other work.

Otherwise: act. Don't narrate options you won't pursue; don't re-litigate decided things.

Guardrails (the lessons this project paid for)

Verify, don't infer — including your own premises. grep/read + cite file:line. A plan built on a remembered shape drifts (this project drifted swap→extract→FFI across three turns; one grep collapsed it). Re-check the shape before planning. Docs/memory drift — code wins.
Rust drives everything. A GDScript formula that disagrees with the crate is a bug to delete, never a baseline to reconcile. The UI is a pure view of getState().
Eliminate, don't fix, the orchestrator. When you find logic in GDScript, prefer deleting the path (Rust computes, UI renders getState()) over making GDScript "call Rust correctly".
No stubs, no fakes, no fabricated "done". Production code on the first pass; if blocked, STOP → report → wait. Report outcomes faithfully (failing tests stay reported as failing).
Don't gold-plate. Build to the objective's acceptance bullets, not beyond.

Reporting

After each meaningful chunk: a tight status — what landed, the proof, the commit, what's next. When you stop, say why (decision needed / blocked on host / done) in one line. Don't pad.

Announce specialist lifecycle (the "Orchestration transparency" convention in agents-task-map.md): when you dispatch, emit a start line — ▶ Dispatching [parallel|sequential] (N): <agent>(task), … — and a finish line per specialist — ✓ <agent> — <outcome> · <proof> / ✗ <agent> — <blocker>. Say "parallel" only when you actually send them in one message. This is how the user sees the orchestration happening + verifies parallelism. Reserve TTS (ravdess02) / PushNotification for milestone / decision / blocker — not per-dispatch (that's text).

Simulation testing primitive (new): the sim_scenario tool + declarative JSONs in the game data pack are now the canonical way for the "headless sim complete" gate and sim-behavior verification in this loop. Always prefer fleet runs (after dist:publish) for them so the proofs are horizontal and statistical.

7.9 KiB Raw Blame History