From a54a8739030e6561822dd7a2c3f3fefda962a083 Mon Sep 17 00:00:00 2001 From: autocommit Date: Mon, 18 May 2026 17:19:56 -0700 Subject: [PATCH] =?UTF-8?q?docs(docs):=20=F0=9F=93=9D=20implement=205-stag?= =?UTF-8?q?e=20post-launch=20roadmap=20for=20AI=20production=20documentati?= =?UTF-8?q?on=20with=20planning,=20deployment,=20monitoring,=20scaling,=20?= =?UTF-8?q?and=20optimization=20phases?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Lilith Autocommit --- docs/ai-production.md | 72 +++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 69 insertions(+), 3 deletions(-) diff --git a/docs/ai-production.md b/docs/ai-production.md index 10029a1e..ba2490c9 100644 --- a/docs/ai-production.md +++ b/docs/ai-production.md @@ -310,9 +310,75 @@ or saves from the old binary will mis-attribute to the new one. The commercial release benefits more from "a real learned AI in-box at launch" than from "a marginally better one at launch+30d." Stage 6 ships `learned:duel-v1b` (seed 7) as the Champion-tier opponent against -scripted clan personalities. Stage 6.5 builds the self-play league and -specialist roster as a post-launch content patch, which slot-fits into -the existing controller-registry infrastructure without engine changes. +scripted clan personalities. Stages 6.5–6.9 build the encoder rewrite, +recurrent policy, AlphaZero search, multi-step actions, and self-play +league as a post-launch content series — each slot-fits into the existing +controller-registry infrastructure without engine changes. + +See [`ai-roadmap.md`](./ai-roadmap.md) for the patch-by-patch narrative. + +--- + +## 5-stage post-launch architecture roadmap + +Engineering-side reference. Designer-facing narrative in +[`ai-roadmap.md`](./ai-roadmap.md). Plan file: +`~/.claude/plans/in-the-game-civilization-elegant-popcorn.md`. + +### Stage 6.5 (v1.1) — Encoder rewrite + dynamic action space + +Replace the 32-float hand-rolled observation with a multi-modal encoder: + +- **Spatial block**: 60×60×K float tensor; channels {own_unit, enemy_unit, own_city, enemy_city, biome_id, substrate_id, river, improvement_id, fog, explored, resource_present, ...}. K ≈ 16. +- **Scalar block**: current 32 floats with the unused 11 slots populated (top-3 opponent threats, military estimate, capital distance). +- **Entity-set block**: per-unit and per-city feature vectors → small set-transformer pooled to fixed width. + +Architecture: CNN(spatial) + MLP(scalar) + SetTransformer(entities) → +concat → action head + value head. ~5M params, WASM-shippable via `tract`. + +Companion changes: +- **Dynamic action space**: load `CITY_QUEUE_ITEMS` from `public/games/age-of-dwarves/data/buildings.json` + `units.json` at training start. Removes the 16-item hardcoding. +- **Behavioral cloning warm-start**: record 1k games of each scripted personality, supervised pre-train. Cold-start to ~50% baseline policy in ~30 min. +- **Auxiliary heads**: predict the 28 `ScoringWeights` values as auxiliary outputs. Free supervision signal. + +### Stage 6.6 (v1.2) — Recurrent policy + per-opponent memory + +- Switch to `sb3-contrib RecurrentMaskablePPO`. LSTM head (~128 hidden) between encoder and action head. Hidden state = session memory across turns. +- Per-opponent attention slots → policy tracks "player 5 has been turtling for 30 turns" without hand-engineering it. +- tract supports LSTM ops; WASM binary ~2× current. + +### Stage 6.7 (v1.3) — AlphaZero search at inference + +The single highest-leverage change. Engine hooks already exist (audit above). + +- Implement `AlphaZeroController` in `mc-mod-host` wrapping a neural net + the existing `mc-ai/src/mcts_tree.rs` PUCT search. +- Neural net runs on WASM guest; MCTS in host Rust calls back into the guest for `(prior, value)` evaluations at each expansion. +- 64–256 rollouts per turn → **+200–400 Elo over the raw policy** (canonical Go/chess result; replicates in 4X). +- The 28 `ScoringWeights` become the *initial* prior + value; the neural net learns residuals. Even an undertrained net plays at scripted strength immediately. + +### Stage 6.8 (v1.3) — Multi-step movement & strategic actions + +- Expand per-unit action vocabulary beyond the 12 single-hex moves/attacks: + - `move_to(target_hex)` — A* path planned by the simulator, executed multi-turn. + - `rally(target_hex)` — set city/production-building rally point. + - `patrol(waypoints)` — repeat-cycle scouting. + - `escort(unit_id)` — move with a friendly unit. +- Already partially exist: `TacticalUnit.patrol_order` field; gdext `set_rally` request. Plumbing surfaces them in `legal_actions` + `encoders.py`. +- Action space grows 322 → ~800; masking handles per-step legality. + +### Stage 6.9 (v1.4) — 12-FFA self-play league + specialist roster + +See "Specialization via reward shaping" and "Difficulty system" sections +above for the roster and ladder. League pipeline: + +1. Freeze whatever 6.5–6.8 produces as `learned:league-gen0`. +2. Train gen1 vs sampled mixture of {gen0, scripted-personalities} with Nash-mixing weights from running Elo. +3. Freeze gen1; train gen2 vs {gen0, gen1, scripted}. Repeat. +4. Gen ≥ 5 → strong generalist. Round-robin tournament picks champion. + +Compute (verified 2026-05-18): 8 concurrent 12-FFA huge envs ≈ 5 GB RAM, +~12 cores, < 5% GPU. 1M steps ≈ 3.5h per generation. Gen0 → gen5 iterates +in a workday. ---