docs(docs): 📝 Add deployment steps and monitoring guidelines for AI models in production
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
parent
d747186f12
commit
76a2ea78da
1 changed files with 326 additions and 0 deletions
326
docs/ai-production.md
Normal file
326
docs/ai-production.md
Normal file
|
|
@ -0,0 +1,326 @@
|
|||
# AI Production Guide — Magic Civilization
|
||||
|
||||
How the game ships AI, how learned policies are trained, how to add a new
|
||||
specialist, and how difficulty levels are constructed. This is the
|
||||
designer/engineering reference. The community-facing modder contract is
|
||||
`docs/modding/ai-controller.md`.
|
||||
|
||||
---
|
||||
|
||||
## TL;DR
|
||||
|
||||
- Two controller families, both selectable per slot:
|
||||
- **`scripted:*`** — MCTS + heuristic AI from the `mc-ai` crate.
|
||||
Transparent, hand-tunable, fast. Anchors the named clan personalities.
|
||||
- **`learned:*`** — neural policy trained with MaskablePPO. Strong,
|
||||
opaque. Anchors high-difficulty tiers and tournament play.
|
||||
- Difficulty is **orthogonal to controller choice** — handicaps + policy
|
||||
temperature stacked on top of either family.
|
||||
- Specialization (rush / turtle / tech / economy) is via **different reward
|
||||
functions on the same architecture**, each a separate `best_model.zip`
|
||||
shipped as its own controller.
|
||||
- Strong-AI ceiling is raised by **AlphaZero search at inference** + the
|
||||
**12-FFA self-play league** (Stage 6.7 + 6.9, post-launch).
|
||||
|
||||
---
|
||||
|
||||
## Coverage matrix — what each AI actually knows
|
||||
|
||||
This is the load-bearing diagnostic. The current `learned:duel-v1b` ships
|
||||
with a 32-float hand-rolled observation vector that throws away most of
|
||||
the engine's state. The scripted AI reads everything via `TacticalState` +
|
||||
28 `ScoringWeights`.
|
||||
|
||||
| Concept | Scripted (`mc-ai` + 28 weights) | Learned (`encoders.py`) |
|
||||
|---|---|---|
|
||||
| Terrain biome / substrate / yields per tile | ✓ `TacticalTile` (state.rs:76-90) | ✗ encoder discards all tile data (encoders.py:18) |
|
||||
| Flora / fauna entities | ✗ (not on wire) | ✗ |
|
||||
| Production buildings | ✓ full data-driven catalog | ✗ **hardcoded 16-item list** (`CITY_QUEUE_ITEMS`) |
|
||||
| Research / tech tree | ✓ via gate prerequisites | ✗ only `science_per_turn` |
|
||||
| Strategic resource gating | ✓ `strategic_resources` | ✗ never reads stockpile |
|
||||
| Rally / patrol / scout actions | △ stored, not actively issued | ✗ not in action space |
|
||||
| Diplomacy detail + personality | ✓ 6 strategic axes per opponent | ✗ only war/peace/borders counts |
|
||||
| Tiles worked per city | ✓ `tiles_worked` | ✗ |
|
||||
| **Multi-step pathfinding at decision time** | ✗ 1-action lookahead | ✗ 1-action lookahead |
|
||||
|
||||
**The engine is not the bottleneck.** `PlayerView` already exposes every
|
||||
piece of state in the left column (`TileView` carries biome / substrate /
|
||||
river / improvement / visible / explored; `CityView.buildable[]` carries
|
||||
the full catalog; `ResearchView` carries the whole tech tree; per-opponent
|
||||
`DiplomacyView` is on the wire). The encoder is the bottleneck.
|
||||
|
||||
This matrix drives the 5-stage roadmap in
|
||||
[`ai-roadmap.md`](./ai-roadmap.md).
|
||||
|
||||
---
|
||||
|
||||
## AlphaZero-readiness audit (2026-05-18)
|
||||
|
||||
The codebase is already structured for an AlphaZero-grade learned AI; the
|
||||
hooks exist but nothing is plugged into them.
|
||||
|
||||
| Hook | Location | Status |
|
||||
|---|---|---|
|
||||
| PUCT tree MCTS with action priors + value-head rollout | `mc-ai/src/mcts_tree.rs:62-249` | ✓ ready; both `TreeState::action_prior()` and `TreeState::rollout()` are overridable |
|
||||
| Per-tile spatial state (CNN-ready) | `mc-player-api/src/view.rs:191-212` | ✓ all channels present in `TileView` |
|
||||
| Controller registry trait | `mc-player-api/src/controllers.rs:58-150` | ✓ a future `AlphaZeroController` plugs in alongside scripted/learned |
|
||||
| 28 evaluator weights as auxiliary loss targets | `mc-core/src/scoring_weights.rs:174` | ✓ N=28 scalar fields, ready-made supervision signal |
|
||||
| Fog-of-war + visibility filter | `mc-observation/src/fog.rs` + `mc-vision::compute_vision()` | ✓ wired into projection; policy never sees data it shouldn't |
|
||||
|
||||
---
|
||||
|
||||
## How a learned policy actually works
|
||||
|
||||
**Not seed search.** The seed sets the RNG for weight initialization +
|
||||
environment rollout order. Different seeds produce different local optima
|
||||
of the **same** learning process; we run multiple seeds because PPO is
|
||||
high-variance.
|
||||
|
||||
**Weight optimization via gradient descent.** Concretely:
|
||||
|
||||
1. **Policy network.** A small MLP (~2 hidden layers, ~64 units) maps
|
||||
`observation (32 floats) → action distribution (322 logits)`. Weights
|
||||
start random.
|
||||
2. **Rollout.** Policy samples action `a` from current state `s`;
|
||||
environment returns reward `r` and next state `s'`. Collect ~512 such
|
||||
transitions in a buffer.
|
||||
3. **Advantage.** A critic network predicts expected return per state.
|
||||
Advantage `A(s, a) = actual_return − critic_prediction`. Positive
|
||||
advantage = action was better than baseline; negative = worse.
|
||||
4. **PPO update.** Gradient-ascend the policy weights to make positive-
|
||||
advantage actions more probable, negative ones less, clipped so a
|
||||
single update can't move probabilities more than 20% (the "proximal"
|
||||
in PPO).
|
||||
5. **Repeat** for 250k–1M environment steps. Weights drift from random to
|
||||
"actions that win games."
|
||||
|
||||
Three parallel seeds = three independent fits. We ship the best by
|
||||
tournament win-rate; the others are discarded.
|
||||
|
||||
**Action masking.** MaskablePPO multiplies action logits by a legal-action
|
||||
mask before sampling — the policy can never propose an illegal action.
|
||||
Mask comes from `encode_legal_actions()` in
|
||||
`tooling/rl_self_play/encoders.py`.
|
||||
|
||||
---
|
||||
|
||||
## Controller families
|
||||
|
||||
### `scripted:*`
|
||||
|
||||
| Controller ID | Use |
|
||||
|---|---|
|
||||
| `scripted:default` | The general-purpose MCTS+heuristic AI; default for unknown ids. |
|
||||
| `scripted:warmonger` | Personality: war-weight 2.0, expansion-weight 1.5. |
|
||||
| `scripted:builder` | Personality: economy-weight 2.0, war-weight 0.5. |
|
||||
| `scripted:tinkersmith` | Personality: tech-weight 2.5, military-weight 1.0. |
|
||||
| `scripted:peaceful` | Personality: war-weight 0.3, diplomacy-weight 2.0. |
|
||||
| `scripted:opportunist` | Personality: dynamic re-weighting from situation. |
|
||||
|
||||
Personalities are pure data in `public/games/age-of-dwarves/data/ai_personalities.json`.
|
||||
Adding a new one is a JSON edit; no Rust changes.
|
||||
|
||||
### `learned:*`
|
||||
|
||||
| Controller ID | Use |
|
||||
|---|---|
|
||||
| `learned:duel-v1b` | First in-box learned mod (Stage 6). Trained on duel maps vs scripted baseline; generalist. |
|
||||
| `learned:rush` *(Stage 6.5)* | Reward-shaped for early military pressure. |
|
||||
| `learned:turtle` *(6.5)* | Reward-shaped for defensive consolidation. |
|
||||
| `learned:tech` *(6.5)* | Reward-shaped for research throughput. |
|
||||
| `learned:economy` *(6.5)* | Reward-shaped for gold + city count. |
|
||||
| `learned:league-genN` *(6.5)* | Self-play league generations. |
|
||||
|
||||
Each `learned:*` ships as its own `.wasm` mod (~400 KB after ONNX → tract
|
||||
compile). Native `.so/.dll/.dylib` variants ship signed for users opting
|
||||
into the faster path.
|
||||
|
||||
---
|
||||
|
||||
## Specialization via reward shaping
|
||||
|
||||
Same network architecture (`encoders.py` + 2-layer MLP). Different reward
|
||||
function trained with the same `train.py` loop. Each variant produces a
|
||||
separate `best_model.zip` registered as a distinct controller.
|
||||
|
||||
**Baseline reward** (current `magic_civ_env.py`):
|
||||
```
|
||||
+1.0 on win (game_over event, winner == me)
|
||||
-1.0 on loss (game_over event, winner != me)
|
||||
+1e-2 per turn advance
|
||||
+1e-3 per score_estimate delta
|
||||
-5e-4 per step (anti-stalling)
|
||||
```
|
||||
|
||||
**Specialist overlays** (added on top of baseline):
|
||||
|
||||
| Variant | Extra reward terms |
|
||||
|---|---|
|
||||
| `rush` | `+0.5` per enemy unit killed before turn 80; `-1e-2` per turn after turn 80 |
|
||||
| `turtle` | `+0.05` per friendly unit fortified-on-defense-tile; `+0.1` per wall built |
|
||||
| `tech` | `+5e-3` per `science_per_turn` delta; `+0.3` per tech unlocked |
|
||||
| `economy` | `+1e-3` per gold-reserve delta; `+0.5` per city founded |
|
||||
|
||||
Tuning rule: extras must sum to less than the terminal `±1.0` across a
|
||||
typical game, otherwise the policy learns the shaping signal instead of
|
||||
winning. Validate per-variant: `evaluate.py` must show win-rate ≥ baseline
|
||||
when the specialist is used, not just "the specialist's shaping signal is
|
||||
higher."
|
||||
|
||||
Adding a new specialist:
|
||||
1. Add an entry to `magic_civ_env.py::RewardOverlay` enum + the shaping
|
||||
logic in `step()`.
|
||||
2. Run `train.py --reward-overlay <name> --total-steps 250000 --seed 7`.
|
||||
3. Evaluate vs `scripted:default` and vs `learned:duel-v1b`.
|
||||
4. If win-rate ≥ 0.55 against both, ship as `learned:<name>`.
|
||||
|
||||
---
|
||||
|
||||
## Difficulty system
|
||||
|
||||
Difficulty is **never** "a weaker neural net." Two orthogonal levers:
|
||||
|
||||
### 1. Resource handicaps
|
||||
|
||||
Per-difficulty multipliers in
|
||||
`public/games/age-of-dwarves/data/difficulty.json` (schema TBD):
|
||||
```json
|
||||
{
|
||||
"id": "settler",
|
||||
"human_resource_mul": 1.0,
|
||||
"ai_resource_mul": 0.7,
|
||||
"ai_unit_xp_bonus": 0
|
||||
}
|
||||
```
|
||||
|
||||
Applied at city-yield + unit-creation time in `mc-economy`.
|
||||
|
||||
### 2. Policy temperature
|
||||
|
||||
For `learned:*` controllers, a `temperature: f32` field on the controller
|
||||
config divides the logits before sampling:
|
||||
```
|
||||
softmax(logits / T)
|
||||
```
|
||||
- `T = 1.0` — base policy.
|
||||
- `T > 1.0` — softer distribution, more random, easier.
|
||||
- `T < 1.0` — sharper, near-greedy, harder.
|
||||
- `T → 0` — argmax (deterministic).
|
||||
|
||||
Implementation: ~10 LOC in `WasmAiController::decide_turn` (apply scaling
|
||||
before the wasm guest samples, OR pass T through as a guest parameter and
|
||||
let the guest apply it). Stage 6.5 work.
|
||||
|
||||
### Recommended Game 1 ladder
|
||||
|
||||
| Difficulty | Controller | T | Handicap |
|
||||
|---|---|---|---|
|
||||
| Settler | `scripted:peaceful` | n/a | AI ×0.7 |
|
||||
| Chieftain | `scripted:default` | n/a | none |
|
||||
| Warlord | `scripted:*` rotating | n/a | none |
|
||||
| King | `learned:league-best` | 1.5 | none |
|
||||
| Champion | `learned:league-best` | 0.3 | AI ×1.3 |
|
||||
|
||||
---
|
||||
|
||||
## Training infrastructure
|
||||
|
||||
### Hosts
|
||||
|
||||
- **Edit host (mac):** authoring; never trains.
|
||||
- **Run host (apricot.lan):** 64-core Threadripper, 94 GB RAM, 2×3090.
|
||||
All training runs here.
|
||||
- **Plum:** screenshot capture only; no training.
|
||||
|
||||
### Layout
|
||||
|
||||
```
|
||||
tooling/rl_self_play/
|
||||
├── train.py # PPO loop, sb3-contrib MaskablePPO
|
||||
├── evaluate.py # Hard win-rate measurement
|
||||
├── magic_civ_env.py # Gymnasium wrapper + reward shaping
|
||||
├── encoders.py # PlayerView ↔ obs/action tensors
|
||||
├── harness_client.py # JSON-Lines subprocess to Godot headless
|
||||
├── models/<run-name>/ # best_model.zip per training run
|
||||
└── runs/<run-name>/ # tensorboard event files
|
||||
```
|
||||
|
||||
`tooling/rl_self_play/models/` and `runs/` are gitignored (bulky; not
|
||||
artifacts of the source repo).
|
||||
|
||||
### Single-game training (duel)
|
||||
|
||||
```bash
|
||||
ssh apricot.lan
|
||||
cd ~/Code/@projects/@magic-civilization
|
||||
python -m tooling.rl_self_play.train \
|
||||
--run-name duel-v1b \
|
||||
--total-steps 250000 \
|
||||
--num-envs 16 \
|
||||
--seed 7 \
|
||||
--device cuda:1
|
||||
```
|
||||
|
||||
`--num-envs N` runs N parallel headless Godot subprocesses; sb3-contrib
|
||||
SubprocVecEnv lock-steps them. Scaling is sub-linear because env-step is
|
||||
I/O-bound on JSON-Lines, not GPU-bound (the policy net is tiny). Past 16
|
||||
envs per training run, returns diminish.
|
||||
|
||||
### Parallel seed runs
|
||||
|
||||
Three independent seeds in parallel claim ~3 × 16 = 48 worker subprocesses
|
||||
on apricot. Memory headroom: each Godot headless ~600 MB, so ~30 GB total
|
||||
— fits inside 94 GB with margin for the OS + other services.
|
||||
|
||||
### 12-FFA self-play league (Stage 6.5)
|
||||
|
||||
```bash
|
||||
python -m tooling.rl_self_play.train \
|
||||
--run-name league-gen1 \
|
||||
--map-type 12ffa-huge \
|
||||
--opponent-pool models/league/gen0/best_model.zip \
|
||||
--total-steps 1000000 \
|
||||
--num-envs 4 \
|
||||
--seed 7 \
|
||||
--device cuda:1
|
||||
```
|
||||
|
||||
12-slot games are ~10× a duel in wall-clock per env, BUT GPU is not the
|
||||
bottleneck (the policy is a ~50k-param MLP). Verified on apricot
|
||||
2026-05-18: 8 concurrent 12-FFA envs ≈ 5 GB RAM, ~12 cores, both GPUs at
|
||||
< 5% utilization. 1M steps ≈ 3.5h per league generation.
|
||||
|
||||
---
|
||||
|
||||
## Save format & forward compatibility
|
||||
|
||||
Every save records the `controller_id` AND `controller_hash` per slot
|
||||
(SaveEnvelope v2). Loading a save with a controller the current install
|
||||
doesn't have yields a friendly error from
|
||||
`save_manager.gd::_validate_controllers_after_load`, not a crash mid-game.
|
||||
|
||||
**Mod authors:** never reuse a `controller_id` across incompatible weight
|
||||
versions. Bump the version suffix (`learned:duel-v1c`, not `learned:duel-v1b`)
|
||||
or saves from the old binary will mis-attribute to the new one.
|
||||
|
||||
---
|
||||
|
||||
## Ship-then-improve
|
||||
|
||||
The commercial release benefits more from "a real learned AI in-box at
|
||||
launch" than from "a marginally better one at launch+30d." Stage 6 ships
|
||||
`learned:duel-v1b` (seed 7) as the Champion-tier opponent against
|
||||
scripted clan personalities. Stage 6.5 builds the self-play league and
|
||||
specialist roster as a post-launch content patch, which slot-fits into
|
||||
the existing controller-registry infrastructure without engine changes.
|
||||
|
||||
---
|
||||
|
||||
## Cross-references
|
||||
|
||||
- Modder contract: `docs/modding/ai-controller.md`
|
||||
- ABI decisions: `docs/modding/abi-decisions.md`
|
||||
- Plan file: `~/.claude/plans/in-the-game-civilization-elegant-popcorn.md`
|
||||
- AiController trait: `src/simulator/crates/mc-player-api/src/controllers.rs`
|
||||
- Reward shape: `tooling/rl_self_play/magic_civ_env.py`
|
||||
- Observation encoder: `tooling/rl_self_play/encoders.py`
|
||||
Loading…
Add table
Reference in a new issue