docs(docs): 📝 Add deployment steps and monitoring guidelines for AI models in production

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
autocommit 2026-05-18 16:34:30 -07:00
parent d747186f12
commit 76a2ea78da

326
docs/ai-production.md Normal file
View file

@ -0,0 +1,326 @@
# AI Production Guide — Magic Civilization
How the game ships AI, how learned policies are trained, how to add a new
specialist, and how difficulty levels are constructed. This is the
designer/engineering reference. The community-facing modder contract is
`docs/modding/ai-controller.md`.
---
## TL;DR
- Two controller families, both selectable per slot:
- **`scripted:*`** — MCTS + heuristic AI from the `mc-ai` crate.
Transparent, hand-tunable, fast. Anchors the named clan personalities.
- **`learned:*`** — neural policy trained with MaskablePPO. Strong,
opaque. Anchors high-difficulty tiers and tournament play.
- Difficulty is **orthogonal to controller choice** — handicaps + policy
temperature stacked on top of either family.
- Specialization (rush / turtle / tech / economy) is via **different reward
functions on the same architecture**, each a separate `best_model.zip`
shipped as its own controller.
- Strong-AI ceiling is raised by **AlphaZero search at inference** + the
**12-FFA self-play league** (Stage 6.7 + 6.9, post-launch).
---
## Coverage matrix — what each AI actually knows
This is the load-bearing diagnostic. The current `learned:duel-v1b` ships
with a 32-float hand-rolled observation vector that throws away most of
the engine's state. The scripted AI reads everything via `TacticalState` +
28 `ScoringWeights`.
| Concept | Scripted (`mc-ai` + 28 weights) | Learned (`encoders.py`) |
|---|---|---|
| Terrain biome / substrate / yields per tile | ✓ `TacticalTile` (state.rs:76-90) | ✗ encoder discards all tile data (encoders.py:18) |
| Flora / fauna entities | ✗ (not on wire) | ✗ |
| Production buildings | ✓ full data-driven catalog | ✗ **hardcoded 16-item list** (`CITY_QUEUE_ITEMS`) |
| Research / tech tree | ✓ via gate prerequisites | ✗ only `science_per_turn` |
| Strategic resource gating | ✓ `strategic_resources` | ✗ never reads stockpile |
| Rally / patrol / scout actions | △ stored, not actively issued | ✗ not in action space |
| Diplomacy detail + personality | ✓ 6 strategic axes per opponent | ✗ only war/peace/borders counts |
| Tiles worked per city | ✓ `tiles_worked` | ✗ |
| **Multi-step pathfinding at decision time** | ✗ 1-action lookahead | ✗ 1-action lookahead |
**The engine is not the bottleneck.** `PlayerView` already exposes every
piece of state in the left column (`TileView` carries biome / substrate /
river / improvement / visible / explored; `CityView.buildable[]` carries
the full catalog; `ResearchView` carries the whole tech tree; per-opponent
`DiplomacyView` is on the wire). The encoder is the bottleneck.
This matrix drives the 5-stage roadmap in
[`ai-roadmap.md`](./ai-roadmap.md).
---
## AlphaZero-readiness audit (2026-05-18)
The codebase is already structured for an AlphaZero-grade learned AI; the
hooks exist but nothing is plugged into them.
| Hook | Location | Status |
|---|---|---|
| PUCT tree MCTS with action priors + value-head rollout | `mc-ai/src/mcts_tree.rs:62-249` | ✓ ready; both `TreeState::action_prior()` and `TreeState::rollout()` are overridable |
| Per-tile spatial state (CNN-ready) | `mc-player-api/src/view.rs:191-212` | ✓ all channels present in `TileView` |
| Controller registry trait | `mc-player-api/src/controllers.rs:58-150` | ✓ a future `AlphaZeroController` plugs in alongside scripted/learned |
| 28 evaluator weights as auxiliary loss targets | `mc-core/src/scoring_weights.rs:174` | ✓ N=28 scalar fields, ready-made supervision signal |
| Fog-of-war + visibility filter | `mc-observation/src/fog.rs` + `mc-vision::compute_vision()` | ✓ wired into projection; policy never sees data it shouldn't |
---
## How a learned policy actually works
**Not seed search.** The seed sets the RNG for weight initialization +
environment rollout order. Different seeds produce different local optima
of the **same** learning process; we run multiple seeds because PPO is
high-variance.
**Weight optimization via gradient descent.** Concretely:
1. **Policy network.** A small MLP (~2 hidden layers, ~64 units) maps
`observation (32 floats) → action distribution (322 logits)`. Weights
start random.
2. **Rollout.** Policy samples action `a` from current state `s`;
environment returns reward `r` and next state `s'`. Collect ~512 such
transitions in a buffer.
3. **Advantage.** A critic network predicts expected return per state.
Advantage `A(s, a) = actual_return critic_prediction`. Positive
advantage = action was better than baseline; negative = worse.
4. **PPO update.** Gradient-ascend the policy weights to make positive-
advantage actions more probable, negative ones less, clipped so a
single update can't move probabilities more than 20% (the "proximal"
in PPO).
5. **Repeat** for 250k1M environment steps. Weights drift from random to
"actions that win games."
Three parallel seeds = three independent fits. We ship the best by
tournament win-rate; the others are discarded.
**Action masking.** MaskablePPO multiplies action logits by a legal-action
mask before sampling — the policy can never propose an illegal action.
Mask comes from `encode_legal_actions()` in
`tooling/rl_self_play/encoders.py`.
---
## Controller families
### `scripted:*`
| Controller ID | Use |
|---|---|
| `scripted:default` | The general-purpose MCTS+heuristic AI; default for unknown ids. |
| `scripted:warmonger` | Personality: war-weight 2.0, expansion-weight 1.5. |
| `scripted:builder` | Personality: economy-weight 2.0, war-weight 0.5. |
| `scripted:tinkersmith` | Personality: tech-weight 2.5, military-weight 1.0. |
| `scripted:peaceful` | Personality: war-weight 0.3, diplomacy-weight 2.0. |
| `scripted:opportunist` | Personality: dynamic re-weighting from situation. |
Personalities are pure data in `public/games/age-of-dwarves/data/ai_personalities.json`.
Adding a new one is a JSON edit; no Rust changes.
### `learned:*`
| Controller ID | Use |
|---|---|
| `learned:duel-v1b` | First in-box learned mod (Stage 6). Trained on duel maps vs scripted baseline; generalist. |
| `learned:rush` *(Stage 6.5)* | Reward-shaped for early military pressure. |
| `learned:turtle` *(6.5)* | Reward-shaped for defensive consolidation. |
| `learned:tech` *(6.5)* | Reward-shaped for research throughput. |
| `learned:economy` *(6.5)* | Reward-shaped for gold + city count. |
| `learned:league-genN` *(6.5)* | Self-play league generations. |
Each `learned:*` ships as its own `.wasm` mod (~400 KB after ONNX → tract
compile). Native `.so/.dll/.dylib` variants ship signed for users opting
into the faster path.
---
## Specialization via reward shaping
Same network architecture (`encoders.py` + 2-layer MLP). Different reward
function trained with the same `train.py` loop. Each variant produces a
separate `best_model.zip` registered as a distinct controller.
**Baseline reward** (current `magic_civ_env.py`):
```
+1.0 on win (game_over event, winner == me)
-1.0 on loss (game_over event, winner != me)
+1e-2 per turn advance
+1e-3 per score_estimate delta
-5e-4 per step (anti-stalling)
```
**Specialist overlays** (added on top of baseline):
| Variant | Extra reward terms |
|---|---|
| `rush` | `+0.5` per enemy unit killed before turn 80; `-1e-2` per turn after turn 80 |
| `turtle` | `+0.05` per friendly unit fortified-on-defense-tile; `+0.1` per wall built |
| `tech` | `+5e-3` per `science_per_turn` delta; `+0.3` per tech unlocked |
| `economy` | `+1e-3` per gold-reserve delta; `+0.5` per city founded |
Tuning rule: extras must sum to less than the terminal `±1.0` across a
typical game, otherwise the policy learns the shaping signal instead of
winning. Validate per-variant: `evaluate.py` must show win-rate ≥ baseline
when the specialist is used, not just "the specialist's shaping signal is
higher."
Adding a new specialist:
1. Add an entry to `magic_civ_env.py::RewardOverlay` enum + the shaping
logic in `step()`.
2. Run `train.py --reward-overlay <name> --total-steps 250000 --seed 7`.
3. Evaluate vs `scripted:default` and vs `learned:duel-v1b`.
4. If win-rate ≥ 0.55 against both, ship as `learned:<name>`.
---
## Difficulty system
Difficulty is **never** "a weaker neural net." Two orthogonal levers:
### 1. Resource handicaps
Per-difficulty multipliers in
`public/games/age-of-dwarves/data/difficulty.json` (schema TBD):
```json
{
"id": "settler",
"human_resource_mul": 1.0,
"ai_resource_mul": 0.7,
"ai_unit_xp_bonus": 0
}
```
Applied at city-yield + unit-creation time in `mc-economy`.
### 2. Policy temperature
For `learned:*` controllers, a `temperature: f32` field on the controller
config divides the logits before sampling:
```
softmax(logits / T)
```
- `T = 1.0` — base policy.
- `T > 1.0` — softer distribution, more random, easier.
- `T < 1.0` — sharper, near-greedy, harder.
- `T → 0` — argmax (deterministic).
Implementation: ~10 LOC in `WasmAiController::decide_turn` (apply scaling
before the wasm guest samples, OR pass T through as a guest parameter and
let the guest apply it). Stage 6.5 work.
### Recommended Game 1 ladder
| Difficulty | Controller | T | Handicap |
|---|---|---|---|
| Settler | `scripted:peaceful` | n/a | AI ×0.7 |
| Chieftain | `scripted:default` | n/a | none |
| Warlord | `scripted:*` rotating | n/a | none |
| King | `learned:league-best` | 1.5 | none |
| Champion | `learned:league-best` | 0.3 | AI ×1.3 |
---
## Training infrastructure
### Hosts
- **Edit host (mac):** authoring; never trains.
- **Run host (apricot.lan):** 64-core Threadripper, 94 GB RAM, 2×3090.
All training runs here.
- **Plum:** screenshot capture only; no training.
### Layout
```
tooling/rl_self_play/
├── train.py # PPO loop, sb3-contrib MaskablePPO
├── evaluate.py # Hard win-rate measurement
├── magic_civ_env.py # Gymnasium wrapper + reward shaping
├── encoders.py # PlayerView ↔ obs/action tensors
├── harness_client.py # JSON-Lines subprocess to Godot headless
├── models/<run-name>/ # best_model.zip per training run
└── runs/<run-name>/ # tensorboard event files
```
`tooling/rl_self_play/models/` and `runs/` are gitignored (bulky; not
artifacts of the source repo).
### Single-game training (duel)
```bash
ssh apricot.lan
cd ~/Code/@projects/@magic-civilization
python -m tooling.rl_self_play.train \
--run-name duel-v1b \
--total-steps 250000 \
--num-envs 16 \
--seed 7 \
--device cuda:1
```
`--num-envs N` runs N parallel headless Godot subprocesses; sb3-contrib
SubprocVecEnv lock-steps them. Scaling is sub-linear because env-step is
I/O-bound on JSON-Lines, not GPU-bound (the policy net is tiny). Past 16
envs per training run, returns diminish.
### Parallel seed runs
Three independent seeds in parallel claim ~3 × 16 = 48 worker subprocesses
on apricot. Memory headroom: each Godot headless ~600 MB, so ~30 GB total
— fits inside 94 GB with margin for the OS + other services.
### 12-FFA self-play league (Stage 6.5)
```bash
python -m tooling.rl_self_play.train \
--run-name league-gen1 \
--map-type 12ffa-huge \
--opponent-pool models/league/gen0/best_model.zip \
--total-steps 1000000 \
--num-envs 4 \
--seed 7 \
--device cuda:1
```
12-slot games are ~10× a duel in wall-clock per env, BUT GPU is not the
bottleneck (the policy is a ~50k-param MLP). Verified on apricot
2026-05-18: 8 concurrent 12-FFA envs ≈ 5 GB RAM, ~12 cores, both GPUs at
< 5% utilization. 1M steps 3.5h per league generation.
---
## Save format & forward compatibility
Every save records the `controller_id` AND `controller_hash` per slot
(SaveEnvelope v2). Loading a save with a controller the current install
doesn't have yields a friendly error from
`save_manager.gd::_validate_controllers_after_load`, not a crash mid-game.
**Mod authors:** never reuse a `controller_id` across incompatible weight
versions. Bump the version suffix (`learned:duel-v1c`, not `learned:duel-v1b`)
or saves from the old binary will mis-attribute to the new one.
---
## Ship-then-improve
The commercial release benefits more from "a real learned AI in-box at
launch" than from "a marginally better one at launch+30d." Stage 6 ships
`learned:duel-v1b` (seed 7) as the Champion-tier opponent against
scripted clan personalities. Stage 6.5 builds the self-play league and
specialist roster as a post-launch content patch, which slot-fits into
the existing controller-registry infrastructure without engine changes.
---
## Cross-references
- Modder contract: `docs/modding/ai-controller.md`
- ABI decisions: `docs/modding/abi-decisions.md`
- Plan file: `~/.claude/plans/in-the-game-civilization-elegant-popcorn.md`
- AiController trait: `src/simulator/crates/mc-player-api/src/controllers.rs`
- Reward shape: `tooling/rl_self_play/magic_civ_env.py`
- Observation encoder: `tooling/rl_self_play/encoders.py`