From 76a2ea78dae9fcb13d568e6f700e8d6add53f088 Mon Sep 17 00:00:00 2001
From: autocommit <autocommit@ftw.codes>
Date: Mon, 18 May 2026 16:34:30 -0700
Subject: [PATCH] =?UTF-8?q?docs(docs):=20=F0=9F=93=9D=20Add=20deployment?=
 =?UTF-8?q?=20steps=20and=20monitoring=20guidelines=20for=20AI=20models=20?=
 =?UTF-8?q?in=20production?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
---
 docs/ai-production.md | 326 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 326 insertions(+)
 create mode 100644 docs/ai-production.md

diff --git a/docs/ai-production.md b/docs/ai-production.md
new file mode 100644
index 00000000..10029a1e
--- /dev/null
+++ b/docs/ai-production.md
@@ -0,0 +1,326 @@
+# AI Production Guide — Magic Civilization
+
+How the game ships AI, how learned policies are trained, how to add a new
+specialist, and how difficulty levels are constructed. This is the
+designer/engineering reference. The community-facing modder contract is
+`docs/modding/ai-controller.md`.
+
+---
+
+## TL;DR
+
+- Two controller families, both selectable per slot:
+  - **`scripted:*`** — MCTS + heuristic AI from the `mc-ai` crate.
+    Transparent, hand-tunable, fast. Anchors the named clan personalities.
+  - **`learned:*`** — neural policy trained with MaskablePPO. Strong,
+    opaque. Anchors high-difficulty tiers and tournament play.
+- Difficulty is **orthogonal to controller choice** — handicaps + policy
+  temperature stacked on top of either family.
+- Specialization (rush / turtle / tech / economy) is via **different reward
+  functions on the same architecture**, each a separate `best_model.zip`
+  shipped as its own controller.
+- Strong-AI ceiling is raised by **AlphaZero search at inference** + the
+  **12-FFA self-play league** (Stage 6.7 + 6.9, post-launch).
+
+---
+
+## Coverage matrix — what each AI actually knows
+
+This is the load-bearing diagnostic. The current `learned:duel-v1b` ships
+with a 32-float hand-rolled observation vector that throws away most of
+the engine's state. The scripted AI reads everything via `TacticalState` +
+28 `ScoringWeights`.
+
+| Concept | Scripted (`mc-ai` + 28 weights) | Learned (`encoders.py`) |
+|---|---|---|
+| Terrain biome / substrate / yields per tile | ✓ `TacticalTile` (state.rs:76-90) | ✗ encoder discards all tile data (encoders.py:18) |
+| Flora / fauna entities | ✗ (not on wire) | ✗ |
+| Production buildings | ✓ full data-driven catalog | ✗ **hardcoded 16-item list** (`CITY_QUEUE_ITEMS`) |
+| Research / tech tree | ✓ via gate prerequisites | ✗ only `science_per_turn` |
+| Strategic resource gating | ✓ `strategic_resources` | ✗ never reads stockpile |
+| Rally / patrol / scout actions | △ stored, not actively issued | ✗ not in action space |
+| Diplomacy detail + personality | ✓ 6 strategic axes per opponent | ✗ only war/peace/borders counts |
+| Tiles worked per city | ✓ `tiles_worked` | ✗ |
+| **Multi-step pathfinding at decision time** | ✗ 1-action lookahead | ✗ 1-action lookahead |
+
+**The engine is not the bottleneck.** `PlayerView` already exposes every
+piece of state in the left column (`TileView` carries biome / substrate /
+river / improvement / visible / explored; `CityView.buildable[]` carries
+the full catalog; `ResearchView` carries the whole tech tree; per-opponent
+`DiplomacyView` is on the wire). The encoder is the bottleneck.
+
+This matrix drives the 5-stage roadmap in
+[`ai-roadmap.md`](./ai-roadmap.md).
+
+---
+
+## AlphaZero-readiness audit (2026-05-18)
+
+The codebase is already structured for an AlphaZero-grade learned AI; the
+hooks exist but nothing is plugged into them.
+
+| Hook | Location | Status |
+|---|---|---|
+| PUCT tree MCTS with action priors + value-head rollout | `mc-ai/src/mcts_tree.rs:62-249` | ✓ ready; both `TreeState::action_prior()` and `TreeState::rollout()` are overridable |
+| Per-tile spatial state (CNN-ready) | `mc-player-api/src/view.rs:191-212` | ✓ all channels present in `TileView` |
+| Controller registry trait | `mc-player-api/src/controllers.rs:58-150` | ✓ a future `AlphaZeroController` plugs in alongside scripted/learned |
+| 28 evaluator weights as auxiliary loss targets | `mc-core/src/scoring_weights.rs:174` | ✓ N=28 scalar fields, ready-made supervision signal |
+| Fog-of-war + visibility filter | `mc-observation/src/fog.rs` + `mc-vision::compute_vision()` | ✓ wired into projection; policy never sees data it shouldn't |
+
+---
+
+## How a learned policy actually works
+
+**Not seed search.** The seed sets the RNG for weight initialization +
+environment rollout order. Different seeds produce different local optima
+of the **same** learning process; we run multiple seeds because PPO is
+high-variance.
+
+**Weight optimization via gradient descent.** Concretely:
+
+1. **Policy network.** A small MLP (~2 hidden layers, ~64 units) maps
+   `observation (32 floats) → action distribution (322 logits)`. Weights
+   start random.
+2. **Rollout.** Policy samples action `a` from current state `s`;
+   environment returns reward `r` and next state `s'`. Collect ~512 such
+   transitions in a buffer.
+3. **Advantage.** A critic network predicts expected return per state.
+   Advantage `A(s, a) = actual_return − critic_prediction`. Positive
+   advantage = action was better than baseline; negative = worse.
+4. **PPO update.** Gradient-ascend the policy weights to make positive-
+   advantage actions more probable, negative ones less, clipped so a
+   single update can't move probabilities more than 20% (the "proximal"
+   in PPO).
+5. **Repeat** for 250k–1M environment steps. Weights drift from random to
+   "actions that win games."
+
+Three parallel seeds = three independent fits. We ship the best by
+tournament win-rate; the others are discarded.
+
+**Action masking.** MaskablePPO multiplies action logits by a legal-action
+mask before sampling — the policy can never propose an illegal action.
+Mask comes from `encode_legal_actions()` in
+`tooling/rl_self_play/encoders.py`.
+
+---
+
+## Controller families
+
+### `scripted:*`
+
+| Controller ID | Use |
+|---|---|
+| `scripted:default` | The general-purpose MCTS+heuristic AI; default for unknown ids. |
+| `scripted:warmonger` | Personality: war-weight 2.0, expansion-weight 1.5. |
+| `scripted:builder` | Personality: economy-weight 2.0, war-weight 0.5. |
+| `scripted:tinkersmith` | Personality: tech-weight 2.5, military-weight 1.0. |
+| `scripted:peaceful` | Personality: war-weight 0.3, diplomacy-weight 2.0. |
+| `scripted:opportunist` | Personality: dynamic re-weighting from situation. |
+
+Personalities are pure data in `public/games/age-of-dwarves/data/ai_personalities.json`.
+Adding a new one is a JSON edit; no Rust changes.
+
+### `learned:*`
+
+| Controller ID | Use |
+|---|---|
+| `learned:duel-v1b` | First in-box learned mod (Stage 6). Trained on duel maps vs scripted baseline; generalist. |
+| `learned:rush` *(Stage 6.5)* | Reward-shaped for early military pressure. |
+| `learned:turtle` *(6.5)* | Reward-shaped for defensive consolidation. |
+| `learned:tech` *(6.5)* | Reward-shaped for research throughput. |
+| `learned:economy` *(6.5)* | Reward-shaped for gold + city count. |
+| `learned:league-genN` *(6.5)* | Self-play league generations. |
+
+Each `learned:*` ships as its own `.wasm` mod (~400 KB after ONNX → tract
+compile). Native `.so/.dll/.dylib` variants ship signed for users opting
+into the faster path.
+
+---
+
+## Specialization via reward shaping
+
+Same network architecture (`encoders.py` + 2-layer MLP). Different reward
+function trained with the same `train.py` loop. Each variant produces a
+separate `best_model.zip` registered as a distinct controller.
+
+**Baseline reward** (current `magic_civ_env.py`):
+```
++1.0   on win  (game_over event, winner == me)
+-1.0   on loss (game_over event, winner != me)
++1e-2  per turn advance
++1e-3  per score_estimate delta
+-5e-4  per step (anti-stalling)
+```
+
+**Specialist overlays** (added on top of baseline):
+
+| Variant | Extra reward terms |
+|---|---|
+| `rush` | `+0.5` per enemy unit killed before turn 80; `-1e-2` per turn after turn 80 |
+| `turtle` | `+0.05` per friendly unit fortified-on-defense-tile; `+0.1` per wall built |
+| `tech` | `+5e-3` per `science_per_turn` delta; `+0.3` per tech unlocked |
+| `economy` | `+1e-3` per gold-reserve delta; `+0.5` per city founded |
+
+Tuning rule: extras must sum to less than the terminal `±1.0` across a
+typical game, otherwise the policy learns the shaping signal instead of
+winning. Validate per-variant: `evaluate.py` must show win-rate ≥ baseline
+when the specialist is used, not just "the specialist's shaping signal is
+higher."
+
+Adding a new specialist:
+1. Add an entry to `magic_civ_env.py::RewardOverlay` enum + the shaping
+   logic in `step()`.
+2. Run `train.py --reward-overlay <name> --total-steps 250000 --seed 7`.
+3. Evaluate vs `scripted:default` and vs `learned:duel-v1b`.
+4. If win-rate ≥ 0.55 against both, ship as `learned:<name>`.
+
+---
+
+## Difficulty system
+
+Difficulty is **never** "a weaker neural net." Two orthogonal levers:
+
+### 1. Resource handicaps
+
+Per-difficulty multipliers in
+`public/games/age-of-dwarves/data/difficulty.json` (schema TBD):
+```json
+{
+  "id": "settler",
+  "human_resource_mul": 1.0,
+  "ai_resource_mul": 0.7,
+  "ai_unit_xp_bonus": 0
+}
+```
+
+Applied at city-yield + unit-creation time in `mc-economy`.
+
+### 2. Policy temperature
+
+For `learned:*` controllers, a `temperature: f32` field on the controller
+config divides the logits before sampling:
+```
+softmax(logits / T)
+```
+- `T = 1.0` — base policy.
+- `T > 1.0` — softer distribution, more random, easier.
+- `T < 1.0` — sharper, near-greedy, harder.
+- `T → 0` — argmax (deterministic).
+
+Implementation: ~10 LOC in `WasmAiController::decide_turn` (apply scaling
+before the wasm guest samples, OR pass T through as a guest parameter and
+let the guest apply it). Stage 6.5 work.
+
+### Recommended Game 1 ladder
+
+| Difficulty | Controller | T | Handicap |
+|---|---|---|---|
+| Settler | `scripted:peaceful` | n/a | AI ×0.7 |
+| Chieftain | `scripted:default` | n/a | none |
+| Warlord | `scripted:*` rotating | n/a | none |
+| King | `learned:league-best` | 1.5 | none |
+| Champion | `learned:league-best` | 0.3 | AI ×1.3 |
+
+---
+
+## Training infrastructure
+
+### Hosts
+
+- **Edit host (mac):** authoring; never trains.
+- **Run host (apricot.lan):** 64-core Threadripper, 94 GB RAM, 2×3090.
+  All training runs here.
+- **Plum:** screenshot capture only; no training.
+
+### Layout
+
+```
+tooling/rl_self_play/
+├── train.py              # PPO loop, sb3-contrib MaskablePPO
+├── evaluate.py           # Hard win-rate measurement
+├── magic_civ_env.py      # Gymnasium wrapper + reward shaping
+├── encoders.py           # PlayerView ↔ obs/action tensors
+├── harness_client.py     # JSON-Lines subprocess to Godot headless
+├── models/<run-name>/    # best_model.zip per training run
+└── runs/<run-name>/      # tensorboard event files
+```
+
+`tooling/rl_self_play/models/` and `runs/` are gitignored (bulky; not
+artifacts of the source repo).
+
+### Single-game training (duel)
+
+```bash
+ssh apricot.lan
+cd ~/Code/@projects/@magic-civilization
+python -m tooling.rl_self_play.train \
+  --run-name duel-v1b \
+  --total-steps 250000 \
+  --num-envs 16 \
+  --seed 7 \
+  --device cuda:1
+```
+
+`--num-envs N` runs N parallel headless Godot subprocesses; sb3-contrib
+SubprocVecEnv lock-steps them. Scaling is sub-linear because env-step is
+I/O-bound on JSON-Lines, not GPU-bound (the policy net is tiny). Past 16
+envs per training run, returns diminish.
+
+### Parallel seed runs
+
+Three independent seeds in parallel claim ~3 × 16 = 48 worker subprocesses
+on apricot. Memory headroom: each Godot headless ~600 MB, so ~30 GB total
+— fits inside 94 GB with margin for the OS + other services.
+
+### 12-FFA self-play league (Stage 6.5)
+
+```bash
+python -m tooling.rl_self_play.train \
+  --run-name league-gen1 \
+  --map-type 12ffa-huge \
+  --opponent-pool models/league/gen0/best_model.zip \
+  --total-steps 1000000 \
+  --num-envs 4 \
+  --seed 7 \
+  --device cuda:1
+```
+
+12-slot games are ~10× a duel in wall-clock per env, BUT GPU is not the
+bottleneck (the policy is a ~50k-param MLP). Verified on apricot
+2026-05-18: 8 concurrent 12-FFA envs ≈ 5 GB RAM, ~12 cores, both GPUs at
+< 5% utilization. 1M steps ≈ 3.5h per league generation.
+
+---
+
+## Save format & forward compatibility
+
+Every save records the `controller_id` AND `controller_hash` per slot
+(SaveEnvelope v2). Loading a save with a controller the current install
+doesn't have yields a friendly error from
+`save_manager.gd::_validate_controllers_after_load`, not a crash mid-game.
+
+**Mod authors:** never reuse a `controller_id` across incompatible weight
+versions. Bump the version suffix (`learned:duel-v1c`, not `learned:duel-v1b`)
+or saves from the old binary will mis-attribute to the new one.
+
+---
+
+## Ship-then-improve
+
+The commercial release benefits more from "a real learned AI in-box at
+launch" than from "a marginally better one at launch+30d." Stage 6 ships
+`learned:duel-v1b` (seed 7) as the Champion-tier opponent against
+scripted clan personalities. Stage 6.5 builds the self-play league and
+specialist roster as a post-launch content patch, which slot-fits into
+the existing controller-registry infrastructure without engine changes.
+
+---
+
+## Cross-references
+
+- Modder contract: `docs/modding/ai-controller.md`
+- ABI decisions: `docs/modding/abi-decisions.md`
+- Plan file: `~/.claude/plans/in-the-game-civilization-elegant-popcorn.md`
+- AiController trait: `src/simulator/crates/mc-player-api/src/controllers.rs`
+- Reward shape: `tooling/rl_self_play/magic_civ_env.py`
+- Observation encoder: `tooling/rl_self_play/encoders.py`