From 76a2ea78dae9fcb13d568e6f700e8d6add53f088 Mon Sep 17 00:00:00 2001 From: autocommit Date: Mon, 18 May 2026 16:34:30 -0700 Subject: [PATCH] =?UTF-8?q?docs(docs):=20=F0=9F=93=9D=20Add=20deployment?= =?UTF-8?q?=20steps=20and=20monitoring=20guidelines=20for=20AI=20models=20?= =?UTF-8?q?in=20production?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Lilith Autocommit --- docs/ai-production.md | 326 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 326 insertions(+) create mode 100644 docs/ai-production.md diff --git a/docs/ai-production.md b/docs/ai-production.md new file mode 100644 index 00000000..10029a1e --- /dev/null +++ b/docs/ai-production.md @@ -0,0 +1,326 @@ +# AI Production Guide — Magic Civilization + +How the game ships AI, how learned policies are trained, how to add a new +specialist, and how difficulty levels are constructed. This is the +designer/engineering reference. The community-facing modder contract is +`docs/modding/ai-controller.md`. + +--- + +## TL;DR + +- Two controller families, both selectable per slot: + - **`scripted:*`** — MCTS + heuristic AI from the `mc-ai` crate. + Transparent, hand-tunable, fast. Anchors the named clan personalities. + - **`learned:*`** — neural policy trained with MaskablePPO. Strong, + opaque. Anchors high-difficulty tiers and tournament play. +- Difficulty is **orthogonal to controller choice** — handicaps + policy + temperature stacked on top of either family. +- Specialization (rush / turtle / tech / economy) is via **different reward + functions on the same architecture**, each a separate `best_model.zip` + shipped as its own controller. +- Strong-AI ceiling is raised by **AlphaZero search at inference** + the + **12-FFA self-play league** (Stage 6.7 + 6.9, post-launch). + +--- + +## Coverage matrix — what each AI actually knows + +This is the load-bearing diagnostic. The current `learned:duel-v1b` ships +with a 32-float hand-rolled observation vector that throws away most of +the engine's state. The scripted AI reads everything via `TacticalState` + +28 `ScoringWeights`. + +| Concept | Scripted (`mc-ai` + 28 weights) | Learned (`encoders.py`) | +|---|---|---| +| Terrain biome / substrate / yields per tile | ✓ `TacticalTile` (state.rs:76-90) | ✗ encoder discards all tile data (encoders.py:18) | +| Flora / fauna entities | ✗ (not on wire) | ✗ | +| Production buildings | ✓ full data-driven catalog | ✗ **hardcoded 16-item list** (`CITY_QUEUE_ITEMS`) | +| Research / tech tree | ✓ via gate prerequisites | ✗ only `science_per_turn` | +| Strategic resource gating | ✓ `strategic_resources` | ✗ never reads stockpile | +| Rally / patrol / scout actions | △ stored, not actively issued | ✗ not in action space | +| Diplomacy detail + personality | ✓ 6 strategic axes per opponent | ✗ only war/peace/borders counts | +| Tiles worked per city | ✓ `tiles_worked` | ✗ | +| **Multi-step pathfinding at decision time** | ✗ 1-action lookahead | ✗ 1-action lookahead | + +**The engine is not the bottleneck.** `PlayerView` already exposes every +piece of state in the left column (`TileView` carries biome / substrate / +river / improvement / visible / explored; `CityView.buildable[]` carries +the full catalog; `ResearchView` carries the whole tech tree; per-opponent +`DiplomacyView` is on the wire). The encoder is the bottleneck. + +This matrix drives the 5-stage roadmap in +[`ai-roadmap.md`](./ai-roadmap.md). + +--- + +## AlphaZero-readiness audit (2026-05-18) + +The codebase is already structured for an AlphaZero-grade learned AI; the +hooks exist but nothing is plugged into them. + +| Hook | Location | Status | +|---|---|---| +| PUCT tree MCTS with action priors + value-head rollout | `mc-ai/src/mcts_tree.rs:62-249` | ✓ ready; both `TreeState::action_prior()` and `TreeState::rollout()` are overridable | +| Per-tile spatial state (CNN-ready) | `mc-player-api/src/view.rs:191-212` | ✓ all channels present in `TileView` | +| Controller registry trait | `mc-player-api/src/controllers.rs:58-150` | ✓ a future `AlphaZeroController` plugs in alongside scripted/learned | +| 28 evaluator weights as auxiliary loss targets | `mc-core/src/scoring_weights.rs:174` | ✓ N=28 scalar fields, ready-made supervision signal | +| Fog-of-war + visibility filter | `mc-observation/src/fog.rs` + `mc-vision::compute_vision()` | ✓ wired into projection; policy never sees data it shouldn't | + +--- + +## How a learned policy actually works + +**Not seed search.** The seed sets the RNG for weight initialization + +environment rollout order. Different seeds produce different local optima +of the **same** learning process; we run multiple seeds because PPO is +high-variance. + +**Weight optimization via gradient descent.** Concretely: + +1. **Policy network.** A small MLP (~2 hidden layers, ~64 units) maps + `observation (32 floats) → action distribution (322 logits)`. Weights + start random. +2. **Rollout.** Policy samples action `a` from current state `s`; + environment returns reward `r` and next state `s'`. Collect ~512 such + transitions in a buffer. +3. **Advantage.** A critic network predicts expected return per state. + Advantage `A(s, a) = actual_return − critic_prediction`. Positive + advantage = action was better than baseline; negative = worse. +4. **PPO update.** Gradient-ascend the policy weights to make positive- + advantage actions more probable, negative ones less, clipped so a + single update can't move probabilities more than 20% (the "proximal" + in PPO). +5. **Repeat** for 250k–1M environment steps. Weights drift from random to + "actions that win games." + +Three parallel seeds = three independent fits. We ship the best by +tournament win-rate; the others are discarded. + +**Action masking.** MaskablePPO multiplies action logits by a legal-action +mask before sampling — the policy can never propose an illegal action. +Mask comes from `encode_legal_actions()` in +`tooling/rl_self_play/encoders.py`. + +--- + +## Controller families + +### `scripted:*` + +| Controller ID | Use | +|---|---| +| `scripted:default` | The general-purpose MCTS+heuristic AI; default for unknown ids. | +| `scripted:warmonger` | Personality: war-weight 2.0, expansion-weight 1.5. | +| `scripted:builder` | Personality: economy-weight 2.0, war-weight 0.5. | +| `scripted:tinkersmith` | Personality: tech-weight 2.5, military-weight 1.0. | +| `scripted:peaceful` | Personality: war-weight 0.3, diplomacy-weight 2.0. | +| `scripted:opportunist` | Personality: dynamic re-weighting from situation. | + +Personalities are pure data in `public/games/age-of-dwarves/data/ai_personalities.json`. +Adding a new one is a JSON edit; no Rust changes. + +### `learned:*` + +| Controller ID | Use | +|---|---| +| `learned:duel-v1b` | First in-box learned mod (Stage 6). Trained on duel maps vs scripted baseline; generalist. | +| `learned:rush` *(Stage 6.5)* | Reward-shaped for early military pressure. | +| `learned:turtle` *(6.5)* | Reward-shaped for defensive consolidation. | +| `learned:tech` *(6.5)* | Reward-shaped for research throughput. | +| `learned:economy` *(6.5)* | Reward-shaped for gold + city count. | +| `learned:league-genN` *(6.5)* | Self-play league generations. | + +Each `learned:*` ships as its own `.wasm` mod (~400 KB after ONNX → tract +compile). Native `.so/.dll/.dylib` variants ship signed for users opting +into the faster path. + +--- + +## Specialization via reward shaping + +Same network architecture (`encoders.py` + 2-layer MLP). Different reward +function trained with the same `train.py` loop. Each variant produces a +separate `best_model.zip` registered as a distinct controller. + +**Baseline reward** (current `magic_civ_env.py`): +``` ++1.0 on win (game_over event, winner == me) +-1.0 on loss (game_over event, winner != me) ++1e-2 per turn advance ++1e-3 per score_estimate delta +-5e-4 per step (anti-stalling) +``` + +**Specialist overlays** (added on top of baseline): + +| Variant | Extra reward terms | +|---|---| +| `rush` | `+0.5` per enemy unit killed before turn 80; `-1e-2` per turn after turn 80 | +| `turtle` | `+0.05` per friendly unit fortified-on-defense-tile; `+0.1` per wall built | +| `tech` | `+5e-3` per `science_per_turn` delta; `+0.3` per tech unlocked | +| `economy` | `+1e-3` per gold-reserve delta; `+0.5` per city founded | + +Tuning rule: extras must sum to less than the terminal `±1.0` across a +typical game, otherwise the policy learns the shaping signal instead of +winning. Validate per-variant: `evaluate.py` must show win-rate ≥ baseline +when the specialist is used, not just "the specialist's shaping signal is +higher." + +Adding a new specialist: +1. Add an entry to `magic_civ_env.py::RewardOverlay` enum + the shaping + logic in `step()`. +2. Run `train.py --reward-overlay --total-steps 250000 --seed 7`. +3. Evaluate vs `scripted:default` and vs `learned:duel-v1b`. +4. If win-rate ≥ 0.55 against both, ship as `learned:`. + +--- + +## Difficulty system + +Difficulty is **never** "a weaker neural net." Two orthogonal levers: + +### 1. Resource handicaps + +Per-difficulty multipliers in +`public/games/age-of-dwarves/data/difficulty.json` (schema TBD): +```json +{ + "id": "settler", + "human_resource_mul": 1.0, + "ai_resource_mul": 0.7, + "ai_unit_xp_bonus": 0 +} +``` + +Applied at city-yield + unit-creation time in `mc-economy`. + +### 2. Policy temperature + +For `learned:*` controllers, a `temperature: f32` field on the controller +config divides the logits before sampling: +``` +softmax(logits / T) +``` +- `T = 1.0` — base policy. +- `T > 1.0` — softer distribution, more random, easier. +- `T < 1.0` — sharper, near-greedy, harder. +- `T → 0` — argmax (deterministic). + +Implementation: ~10 LOC in `WasmAiController::decide_turn` (apply scaling +before the wasm guest samples, OR pass T through as a guest parameter and +let the guest apply it). Stage 6.5 work. + +### Recommended Game 1 ladder + +| Difficulty | Controller | T | Handicap | +|---|---|---|---| +| Settler | `scripted:peaceful` | n/a | AI ×0.7 | +| Chieftain | `scripted:default` | n/a | none | +| Warlord | `scripted:*` rotating | n/a | none | +| King | `learned:league-best` | 1.5 | none | +| Champion | `learned:league-best` | 0.3 | AI ×1.3 | + +--- + +## Training infrastructure + +### Hosts + +- **Edit host (mac):** authoring; never trains. +- **Run host (apricot.lan):** 64-core Threadripper, 94 GB RAM, 2×3090. + All training runs here. +- **Plum:** screenshot capture only; no training. + +### Layout + +``` +tooling/rl_self_play/ +├── train.py # PPO loop, sb3-contrib MaskablePPO +├── evaluate.py # Hard win-rate measurement +├── magic_civ_env.py # Gymnasium wrapper + reward shaping +├── encoders.py # PlayerView ↔ obs/action tensors +├── harness_client.py # JSON-Lines subprocess to Godot headless +├── models// # best_model.zip per training run +└── runs// # tensorboard event files +``` + +`tooling/rl_self_play/models/` and `runs/` are gitignored (bulky; not +artifacts of the source repo). + +### Single-game training (duel) + +```bash +ssh apricot.lan +cd ~/Code/@projects/@magic-civilization +python -m tooling.rl_self_play.train \ + --run-name duel-v1b \ + --total-steps 250000 \ + --num-envs 16 \ + --seed 7 \ + --device cuda:1 +``` + +`--num-envs N` runs N parallel headless Godot subprocesses; sb3-contrib +SubprocVecEnv lock-steps them. Scaling is sub-linear because env-step is +I/O-bound on JSON-Lines, not GPU-bound (the policy net is tiny). Past 16 +envs per training run, returns diminish. + +### Parallel seed runs + +Three independent seeds in parallel claim ~3 × 16 = 48 worker subprocesses +on apricot. Memory headroom: each Godot headless ~600 MB, so ~30 GB total +— fits inside 94 GB with margin for the OS + other services. + +### 12-FFA self-play league (Stage 6.5) + +```bash +python -m tooling.rl_self_play.train \ + --run-name league-gen1 \ + --map-type 12ffa-huge \ + --opponent-pool models/league/gen0/best_model.zip \ + --total-steps 1000000 \ + --num-envs 4 \ + --seed 7 \ + --device cuda:1 +``` + +12-slot games are ~10× a duel in wall-clock per env, BUT GPU is not the +bottleneck (the policy is a ~50k-param MLP). Verified on apricot +2026-05-18: 8 concurrent 12-FFA envs ≈ 5 GB RAM, ~12 cores, both GPUs at +< 5% utilization. 1M steps ≈ 3.5h per league generation. + +--- + +## Save format & forward compatibility + +Every save records the `controller_id` AND `controller_hash` per slot +(SaveEnvelope v2). Loading a save with a controller the current install +doesn't have yields a friendly error from +`save_manager.gd::_validate_controllers_after_load`, not a crash mid-game. + +**Mod authors:** never reuse a `controller_id` across incompatible weight +versions. Bump the version suffix (`learned:duel-v1c`, not `learned:duel-v1b`) +or saves from the old binary will mis-attribute to the new one. + +--- + +## Ship-then-improve + +The commercial release benefits more from "a real learned AI in-box at +launch" than from "a marginally better one at launch+30d." Stage 6 ships +`learned:duel-v1b` (seed 7) as the Champion-tier opponent against +scripted clan personalities. Stage 6.5 builds the self-play league and +specialist roster as a post-launch content patch, which slot-fits into +the existing controller-registry infrastructure without engine changes. + +--- + +## Cross-references + +- Modder contract: `docs/modding/ai-controller.md` +- ABI decisions: `docs/modding/abi-decisions.md` +- Plan file: `~/.claude/plans/in-the-game-civilization-elegant-popcorn.md` +- AiController trait: `src/simulator/crates/mc-player-api/src/controllers.rs` +- Reward shape: `tooling/rl_self_play/magic_civ_env.py` +- Observation encoder: `tooling/rl_self_play/encoders.py`