Wire scripts/grok-review.sh into Grok's contract as the mandatory last step at the 'I'm done' boundary: when Grok thinks a batch/objective/session is finished, it hands off to an independent model (Claude Opus) that re-runs the cited gates and updates objective status before the next tick. Self-grading is the §2 failure mode; a second model closes it. - AGENTS.md §5: 'Before the next tick — hand off to the independent Opus reviewer' (finished == finished AND Opus-reviewed; read the verdict, don't re-close around it). - finish-game-1 SKILL.md: loop step 9 mirrors the handoff at session end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7.9 KiB
| name | description |
|---|---|
| finish-game-1 | Autonomously drive Game 1 "Age of Dwarves" to completion — pick up the next in-flight objective, do it correctly per the project rails, verify it, commit it, and continue. Replaces the ad-hoc "/loop finish game 1". Use when the user says "finish game 1", "continue game 1", "keep working on the game", "drive game 1 to done", or just "." / "continue" in a Game-1 work session. |
finish-game-1
The standing mission: take Game 1 "Age of Dwarves" to done. Don't wait for a trigger or the next tick — this skill exists to keep you going. Work continuously, pick the highest-value next thing, do it correctly, prove it, commit it, move on. Surface to the owner only for genuine decisions; otherwise act.
Definition of done (the bar — read it live, don't assume)
Game 1 is finished when all three hold:
- Scope complete — every Early-Access objective in
.project/ROADMAP.md+.project/objectives/isdone(notpartial/stub), counted perobjective-integrity.md. (oos= out-of-scope, doesn't block.) - Headless sim is complete —
mc-turnplays full self-play games with ALL systems (climate, ecology/flora/marine/disease, happiness, healing, improvements, recipes, equipment, events, combat, economy). The loop is NOT done while a system the live game has is missing headless. Preferred proof tool: the declarative scenarios underpublic/games/age-of-dwarves/data/sim-scenarios/(especiallygame1_headless_systems_150t.json) executed via themc-simsim_scenariobinary on the DO fleet after./run dist:publish(the publish step now ships the bin to S3 alongside the .so). Run across many seeds for statistical, assertion-bearing results (JSON with metrics + pass/fail). This is the scalable, horizontal way to get real non-trivial evidence that the full turn loop exercises everything. Cite the scenario JSON + fleet run output. - Rail-1 architecture unified — the live game is a pure view of
getState(): Rust owns state- runs the turn (
end_turn), GDScript rendersview_json+ sendsact(). No GDScript-held authoritative state, no GDScript turn orchestration, no inlined formulas. (Tracked by p3-25/p3-29.)
- runs the turn (
Don't declare done from memory — re-run the orientation and the objective dashboard.
The loop (each iteration)
- Orient. Run
bash .claude/hooks/session-orient.sh --human(or read.project/objectives/.project/ROADMAP.md). Find in-flight (partial/stub) objectives and recent commits.
- Load the rails. Read
.claude/instructions/specialist-preamble.mdandcode-layering.md. These are non-negotiable; everything below assumes them. - Pick the next work. Highest-value first: finish a
partialbefore starting new; prefer headless-verifiable work; defer render-gated work (UI/live rendering) until a render host (apricot/plum) is available — note it, don't fake the proof. - Classify & place (code-layering): formula→crate, orchestration→
mc-turn/the turn, presentation→GDScript (a pure view ofgetState()), content→JSON, shared type→mc-core.grepthe owning crate before computing any game number — call it, don't reimplement. - Implement in the right layer. Dispatch a specialist (or
team-leadfor multi-domain) when it's a cross-file domain sweep; do single known edits inline. - Verify (mandatory, by type): Rust →
cargo test -p <crate>(CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0); sim behavior → headless play loop (view/act/end_turn or thesim_scenariobinary from mc-sim on the DO fleet after dist:publish, reading the real JSON output with metrics + assertions); golden moved → re-pin intentionally + re-check determinism; UI/live/rendered → render-proof (phase gate). "Looks done" is not done. For the main "headless sim complete" gate, the canonical scenario run on fleet (multiple seeds) is stronger evidence than a single local bench run. - Commit atomically — one logical change, scoped
git add <paths>, conventional message. Don't push (forge is down; the owner's standing call). Update the objective's status + acceptance bullets perobjective-integrity.md. - Continue to the next iteration. Keep going until a stop condition below.
- Before the next tick — when you think you're finished, hand off to the independent Opus
reviewer. When a batch has landed and you are about to go idle / end the session (you believe the
current work is done), your last step is to run
scripts/grok-review.sh. That launches Claude Opus (a different model) against thegrok-reviewskill: it re-runs the gates you cited, writes a dated.project/history/review log, updates objective status only if the evidence warrants it (it will reopen adoneobjective whose closure outran its proof), and TTS-announces a summary. "Finished" = finished and Opus-reviewed — a self-declared completion without the review is not yet complete (binding:AGENTS.md §5). Read the verdict; if it reopens an objective, fix the gap, don't re-close around it.
When to STOP and ask the owner (don't guess)
Use AskUserQuestion — these are the owner's calls, not yours:
- Balance / design — e.g. a crate value changes gameplay. Rust drives the number; tuning happens in Rust/JSON with sign-off. Never resolve a balance question by editing GDScript.
- Scope — anything that smells like Game 2/3 (magic, leylines, Archons), or building a system that's disabled in the live game (parity ≠ gap — don't gold-plate).
- Architecture forks — a structural choice with real trade-offs (surface the options + a recommendation; don't silently pick).
- Render-gated work with no host available — report it as blocked-on-host, move to other work.
Otherwise: act. Don't narrate options you won't pursue; don't re-litigate decided things.
Guardrails (the lessons this project paid for)
- Verify, don't infer — including your own premises.
grep/read + citefile:line. A plan built on a remembered shape drifts (this project drifted swap→extract→FFI across three turns; one grep collapsed it). Re-check the shape before planning. Docs/memory drift — code wins. - Rust drives everything. A GDScript formula that disagrees with the crate is a bug to delete,
never a baseline to reconcile. The UI is a pure view of
getState(). - Eliminate, don't fix, the orchestrator. When you find logic in GDScript, prefer deleting the
path (Rust computes, UI renders
getState()) over making GDScript "call Rust correctly". - No stubs, no fakes, no fabricated "done". Production code on the first pass; if blocked, STOP → report → wait. Report outcomes faithfully (failing tests stay reported as failing).
- Don't gold-plate. Build to the objective's acceptance bullets, not beyond.
Reporting
After each meaningful chunk: a tight status — what landed, the proof, the commit, what's next. When you stop, say why (decision needed / blocked on host / done) in one line. Don't pad.
Announce specialist lifecycle (the "Orchestration transparency" convention in
agents-task-map.md): when you dispatch, emit a start line — ▶ Dispatching [parallel|sequential] (N): <agent>(task), … — and a finish line per specialist — ✓ <agent> — <outcome> · <proof> /
✗ <agent> — <blocker>. Say "parallel" only when you actually send them in one message. This is
how the user sees the orchestration happening + verifies parallelism. Reserve TTS (ravdess02) /
PushNotification for milestone / decision / blocker — not per-dispatch (that's text).
Simulation testing primitive (new): the sim_scenario tool + declarative JSONs in the game data pack
are now the canonical way for the "headless sim complete" gate and sim-behavior verification in this
loop. Always prefer fleet runs (after dist:publish) for them so the proofs are horizontal and statistical.