From a54a8739030e6561822dd7a2c3f3fefda962a083 Mon Sep 17 00:00:00 2001
From: autocommit <autocommit@ftw.codes>
Date: Mon, 18 May 2026 17:19:56 -0700
Subject: [PATCH] =?UTF-8?q?docs(docs):=20=F0=9F=93=9D=20implement=205-stag?=
 =?UTF-8?q?e=20post-launch=20roadmap=20for=20AI=20production=20documentati?=
 =?UTF-8?q?on=20with=20planning,=20deployment,=20monitoring,=20scaling,=20?=
 =?UTF-8?q?and=20optimization=20phases?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
---
 docs/ai-production.md | 72 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 69 insertions(+), 3 deletions(-)

diff --git a/docs/ai-production.md b/docs/ai-production.md
index 10029a1e..ba2490c9 100644
--- a/docs/ai-production.md
+++ b/docs/ai-production.md
@@ -310,9 +310,75 @@ or saves from the old binary will mis-attribute to the new one.
 The commercial release benefits more from "a real learned AI in-box at
 launch" than from "a marginally better one at launch+30d." Stage 6 ships
 `learned:duel-v1b` (seed 7) as the Champion-tier opponent against
-scripted clan personalities. Stage 6.5 builds the self-play league and
-specialist roster as a post-launch content patch, which slot-fits into
-the existing controller-registry infrastructure without engine changes.
+scripted clan personalities. Stages 6.5–6.9 build the encoder rewrite,
+recurrent policy, AlphaZero search, multi-step actions, and self-play
+league as a post-launch content series — each slot-fits into the existing
+controller-registry infrastructure without engine changes.
+
+See [`ai-roadmap.md`](./ai-roadmap.md) for the patch-by-patch narrative.
+
+---
+
+## 5-stage post-launch architecture roadmap
+
+Engineering-side reference. Designer-facing narrative in
+[`ai-roadmap.md`](./ai-roadmap.md). Plan file:
+`~/.claude/plans/in-the-game-civilization-elegant-popcorn.md`.
+
+### Stage 6.5 (v1.1) — Encoder rewrite + dynamic action space
+
+Replace the 32-float hand-rolled observation with a multi-modal encoder:
+
+- **Spatial block**: 60×60×K float tensor; channels {own_unit, enemy_unit, own_city, enemy_city, biome_id, substrate_id, river, improvement_id, fog, explored, resource_present, ...}. K ≈ 16.
+- **Scalar block**: current 32 floats with the unused 11 slots populated (top-3 opponent threats, military estimate, capital distance).
+- **Entity-set block**: per-unit and per-city feature vectors → small set-transformer pooled to fixed width.
+
+Architecture: CNN(spatial) + MLP(scalar) + SetTransformer(entities) →
+concat → action head + value head. ~5M params, WASM-shippable via `tract`.
+
+Companion changes:
+- **Dynamic action space**: load `CITY_QUEUE_ITEMS` from `public/games/age-of-dwarves/data/buildings.json` + `units.json` at training start. Removes the 16-item hardcoding.
+- **Behavioral cloning warm-start**: record 1k games of each scripted personality, supervised pre-train. Cold-start to ~50% baseline policy in ~30 min.
+- **Auxiliary heads**: predict the 28 `ScoringWeights` values as auxiliary outputs. Free supervision signal.
+
+### Stage 6.6 (v1.2) — Recurrent policy + per-opponent memory
+
+- Switch to `sb3-contrib RecurrentMaskablePPO`. LSTM head (~128 hidden) between encoder and action head. Hidden state = session memory across turns.
+- Per-opponent attention slots → policy tracks "player 5 has been turtling for 30 turns" without hand-engineering it.
+- tract supports LSTM ops; WASM binary ~2× current.
+
+### Stage 6.7 (v1.3) — AlphaZero search at inference
+
+The single highest-leverage change. Engine hooks already exist (audit above).
+
+- Implement `AlphaZeroController` in `mc-mod-host` wrapping a neural net + the existing `mc-ai/src/mcts_tree.rs` PUCT search.
+- Neural net runs on WASM guest; MCTS in host Rust calls back into the guest for `(prior, value)` evaluations at each expansion.
+- 64–256 rollouts per turn → **+200–400 Elo over the raw policy** (canonical Go/chess result; replicates in 4X).
+- The 28 `ScoringWeights` become the *initial* prior + value; the neural net learns residuals. Even an undertrained net plays at scripted strength immediately.
+
+### Stage 6.8 (v1.3) — Multi-step movement & strategic actions
+
+- Expand per-unit action vocabulary beyond the 12 single-hex moves/attacks:
+  - `move_to(target_hex)` — A* path planned by the simulator, executed multi-turn.
+  - `rally(target_hex)` — set city/production-building rally point.
+  - `patrol(waypoints)` — repeat-cycle scouting.
+  - `escort(unit_id)` — move with a friendly unit.
+- Already partially exist: `TacticalUnit.patrol_order` field; gdext `set_rally` request. Plumbing surfaces them in `legal_actions` + `encoders.py`.
+- Action space grows 322 → ~800; masking handles per-step legality.
+
+### Stage 6.9 (v1.4) — 12-FFA self-play league + specialist roster
+
+See "Specialization via reward shaping" and "Difficulty system" sections
+above for the roster and ladder. League pipeline:
+
+1. Freeze whatever 6.5–6.8 produces as `learned:league-gen0`.
+2. Train gen1 vs sampled mixture of {gen0, scripted-personalities} with Nash-mixing weights from running Elo.
+3. Freeze gen1; train gen2 vs {gen0, gen1, scripted}. Repeat.
+4. Gen ≥ 5 → strong generalist. Round-robin tournament picks champion.
+
+Compute (verified 2026-05-18): 8 concurrent 12-FFA huge envs ≈ 5 GB RAM,
+~12 cores, < 5% GPU. 1M steps ≈ 3.5h per generation. Gen0 → gen5 iterates
+in a workday.
 
 ---