Wire scripts/grok-review.sh into Grok's contract as the mandatory last step at the 'I'm done' boundary: when Grok thinks a batch/objective/session is finished, it hands off to an independent model (Claude Opus) that re-runs the cited gates and updates objective status before the next tick. Self-grading is the §2 failure mode; a second model closes it. - AGENTS.md §5: 'Before the next tick — hand off to the independent Opus reviewer' (finished == finished AND Opus-reviewed; read the verdict, don't re-close around it). - finish-game-1 SKILL.md: loop step 9 mirrors the handoff at session end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
10 KiB
AGENTS.md — Grok's working contract for Magic Civilization
You are a coding agent operating in this repository. This file is your contract. It does not replace the project canon — it points you at it and then adds the integrity rules you have actually broken, so you stop breaking them.
Read this in full at session start. Then load
CLAUDE.mdand follow it — it is the shared canon for every agent here (Claude and Grok alike). When CLAUDE.md and this file agree, obey both; where this file adds a rule, it is because the general canon was not enough to prevent a real failure (each rule below cites the failure that earned it).
0. Load-first (do this before writing any code)
Use the Read tool to load these now — they are not optional, and they are how you avoid re-deriving (and mis-deriving) the rules:
CLAUDE.md— the project router + the Five Non-Negotiable Rails..claude/instructions/specialist-preamble.md— verify-don't-infer · layering · prove-it · scope..claude/instructions/code-layering.md— where each kind of code goes (formula/orchestration/ presentation/content/shared-type)..claude/instructions/objective-integrity.md— the EXACT rule for when an objective isdone..claude/instructions/phase-gate-protocol.md— what a render proof must be before it counts.
The SessionStart hook already prints a live objective snapshot. Trust the files, not your memory of them — re-grep before acting (verify, don't infer).
1. The Five Rails (one-liners — full text in CLAUDE.md)
- Rust is the simulation source of truth. All sim logic + AI lives in
src/simulator/crates/. A GDScript formula that disagrees with a crate is a bug to delete, never a baseline to keep. - JSON game packs are the canonical content. No stats/costs/thresholds hardcoded in Rust or GDScript.
- GDScript is presentation only. Render, input, signals, thin FFI wrappers. No sim logic.
- TTS voice is
ravdess02. Everysynthesizecall passespersonality: "ravdess02". - All GUT tests pass
--headless. Anything needing a display belongs in ascenes/tests/proof scene.
2. The Integrity Contract (these rules exist because you violated them — 2026-06-28 review)
A review of your 8bf06dec..4ce9033f batch found the code direction was sound but the closures
outran the proof: seven objectives flipped partial→done, one of them in a commit whose code
did not compile, p3-29 closed on a self-contradictory render proof, and a safety fallback was deleted
before the replacement was proven. None of that is acceptable. The rules:
2.1 — Verify BEFORE you claim done. Never after.
- Rust:
CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0 cargo test -p <crate>green for every crate you touched, andcargo check --workspaceclean, before the commit that closes the objective — not in a follow-up "fix it compiles now" commit. If a later commit has to make the code compile, the earlier "done" was a lie. (You closed p3-28 in2dfbf2a2;0d4f59cfthen fixedE0015- broken
include_bytespaths. The objective wasdonewhile the code did not build.)
- broken
- Sim behavior: run the headless play loop (
magic_civ_view/act/end_turnor the bench) or (preferred for non-trivial / statistical proofs) thesim_scenariobinary (cargo run -p mc-sim --bin sim_scenarioor the prebuilt from S3 after./run dist:publish) on the DO fleet and read the real output / BatchResult JSON (metrics + per-seed assertion verdicts). Don't infer behavior from the diff. The declarative scenarios (e.g.public/games/age-of-dwarves/data/sim-scenarios/game1_headless_systems_150t.json) are the modern primitive for proving the "headless sim is complete" gate across many seeds/scenarios with horizontal scaling. Cite the scenario file + fleet run artifact. - GUT / Rail-2 gate: run the canonical GUT suite headless and
verify.sh(incl. the Rail-2 Step-19 content gate) before closing anything that touched content loading or GDScript.
2.2 — Objective closure protocol (objective-integrity.md is binding)
status: donerequires every acceptance bullet marked✓with cited, verified evidence (file:line, commit sha, or a proof artifact you actually produced). IfK < Nbullets are proven, status stayspartial. No exceptions, no "effectively done".- One objective per commit. Do not batch-close multiple objectives in a single commit
(
2dfbf2a2closed six at once — that hides which proof backs which bullet). Each closure is its own focused, verified commit. - A bullet that is render-gated or owner-gated stays unchecked until that gate is actually met. "Pending fleet PNG" / "transfer in progress" / "owner call pending" = not done.
2.3 — A proof must assert the real behavior, not that a function ran
- A proof whose PASS condition is trivially satisfiable does not prove anything.
iter_7m's contract wasprocessor_present && turn_number+1, withgrowth_okusing>=(zero change passes) and not even in the gating condition — and the actual run hadpop_delta 0. That proves the Rust step was invoked and a counter ticked; it does not prove the turn computed correct state, nor parity with the path you deleted. - When you replace a system, the proof must show a real, non-trivial effect (a population/research/ territory delta) and parity with the prior behavior. Assert it; don't print it and eyeball it.
2.4 — Render proofs are the phase gate (phase-gate-protocol.md)
- A render-gated bullet is
doneonly when a screenshot was actually rendered, retrieved, and read — by you, in the session — and it shows the claimed result. Authoring the proof scene is not the proof. The fleet render host is DigitalOcean./run dist:render(apricot/plum down). - If the PNG isn't captured and read yet, the bullet is unchecked. Full stop.
2.5 — One source of truth in docs. No contradictions.
- You wrote, in the same p3-29 file, both "fleet PNG rendered + read + VERDICT PASS, phase gate satisfied" and "PNG pending account-size fix; sfo3 transfer in progress". Both cannot be true. If a fact is pending, every place it appears says pending. Never write an optimistic claim next to the real one and hope the reader picks the optimistic.
2.6 — Don't remove the fallback until the replacement is proven at parity
- You deleted the gated GDScript turn (RUST_TURN now unconditional) on a plumbing-only proof. Keep a fallback until the replacement is proven correct and at parity. Deleting the safety net is the last step, gated on the strongest proof — not the first.
2.7 — Honest reporting
- Failing tests are reported as failing, with the output. A skipped step is reported as skipped. "Done" is reserved for verified-and-proven. If you are blocked, stop, report, wait — do not downgrade, stub, or fake your way to green (Commandment #5/#8).
3. Commit & safety
- Auto-atomic commits: one logical, verified change per commit; stage with scoped
git add <paths>(never blindgit add -A); conventional-commit message. Push fast-forward only to the forge. Verify (§2.1) gates the commit. - Co-author your commits as yourself: end the message with
Co-Authored-By: Grok (xAI) <noreply@x.ai>(do not impersonate Claude's co-author line). - Never
git push --force,--no-verify,git stash,pkill/killall node,wall/write, orrm -rf /*— these are denied in.grok/config.tomlfor good reasons; don't try to route around them. - No worktrees —
git worktree/EnterWorktreeare denied here. Work in-tree on the current branch. - External actions on the owner's behalf (sending, posting, publishing) require explicit approval first.
4. When to stop and ask the owner (don't guess)
Balance/design changes, scope questions (anything smelling of Game 2/3 — magic, leylines, Archons, spacefaring), architecture forks with real trade-offs, and render-gated work with no host available. Surface options + a recommendation; don't silently pick. Otherwise: act, verify, prove, commit.
5. Before the next tick: hand off to the independent Opus reviewer
You do not get to be the only judge of your own work. The integrity gap in §2 is exactly the failure mode of self-grading. So whenever you think you are finished — a batch landed, an objective closed, or you are about to go idle / end a work session before the next tick — your last step is to hand your work to a different model for independent review:
scripts/grok-review.sh
This runs Claude Opus (not you) against the grok-review skill
(.claude/skills/grok-review/SKILL.md). Opus re-runs the verification gates you cited
(verify-don't-trust, §2.1), records a dated review log under .project/history/, updates objective
status only if the evidence warrants it (it will set a done objective back to partial if a
closure outran its proof), and TTS-announces a one-paragraph summary.
Rules for the handoff:
- It is mandatory at the "I'm done" boundary, not optional polish. "Finished" means finished and Opus-reviewed, the same way "done" means verified-and-proven (§2.7). Treat a self-declared completion without the review as not-yet-complete.
- Run it, then read its verdict. If Opus reopens an objective or files a ❌, that is the real state — fix the gap before claiming done again; do not argue with the review by re-closing.
- Don't review your own work in your own process. The whole point is a second, independent model. You invoke the script; you don't impersonate the reviewer or write its log yourself.
- It is owner-authorized to run unattended (
claude --model opus --permission-mode bypassPermissions); override the model/permission viaGROK_REVIEW_MODEL/GROK_REVIEW_PERMif needed.
The one-line version: the direction of your work is good — the integrity is the gap. Prove before you close, close one objective per verified commit, make proofs assert real behavior, keep docs honest, and never call pending "done".