History

Natalie 60c8ce0ef6 fix(simulator): 🐛 AI/suggest production city_id round-trip + restore gdext build Exposed by a new hotseat full-game driver (drives both player seats over the multi-slot wire, no AI dependency) — a 31-turn 2-player game surfaced these. - mc-player-api: the AI→PlayerAction converter (apply_ai_action + the suggest sibling) emitted the bare tactical city index ("0") for QueueProduction, but find_city_indices needs the projector wire id "{player}_{c_idx}" — so every AI/suggested queue_production failed UnknownCity. This silently broke the in-box AI's production-steering, not just the wire. Emit the wire id at all three sites; thread slot into the suggest converter; add a regression test. Result in the playthrough: roundtrip failures 58→1, city_building_completed 0→18. - api-gdext: advance_round_phase/end_player_round_phase did not compile at HEAD — godot-rust 0.2.4 Array::push needs &Dictionary (AsArg); Pcg64 builds via ::seed not ::seed_from_u64; dropped a dead rng binding. The gdext crate could not be rebuilt from source until this. - mc-worldsim: pub use GamePhase/RoundPhase (api-gdext references them through mc_worldsim; they were a private re-export → E0603). - tooling: add hotseat_playthrough.py — applies each seat's suggested actions and flags any offered action that fails to apply, with severity triage. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-06-23 18:48:37 -04:00
..
tests
__init__.py
_export_onnx_p1_29f.py
bc_pretrain.py
capture_encoder_fixtures.py
diag_suggest.py
encoders.py
evaluate.py
harness_client.py	feat(@projects/@magic-civilization): ✨ add terraforming cascade design and fauna updates	2026-06-09 19:51:48 -07:00
hotseat_playthrough.py	fix(simulator): 🐛 AI/suggest production city_id round-trip + restore gdext build	2026-06-23 18:48:37 -04:00
magic_civ_env.py	feat(game): ✨ persist wind_direction for climate fidelity	2026-06-09 01:17:04 -07:00
mine_divergence.py
opponent.py
README.md
record_expert.py
requirements.txt
smoke.py
smoke_model_opponent.py
smoke_multi_slot.py
train.py

README.md

Magic Civilization RL self-play

Open-source-RL alternative to the cloud-LLM "Claude plays the game" loop. Wraps scripts/player-api-server.sh (the generic JSON-Lines player-API harness) as a Gymnasium environment, then trains a MaskablePPO policy against the harness's built-in AI as a frozen opponent. Reports win-rate against the baseline so we can see exactly when the trained policy beats our shipping MCTS.

Why this stack

OpenSpiel was the obvious first choice for multi-agent general-sum with built-in AlphaZero/MuZero, but the action space requires a custom Game C++ wrapper or an awkward Python-side adapter — too much boilerplate for the iteration we want.
alpha-zero-general is 2-player-only and doesn't compose with Magic Civilization's diplomacy actions cleanly.
stable-baselines3 + sb3-contrib MaskablePPO with the harness as a Gym env gets us a working loop in three files of Python with action-masking out of the box. We give up MuZero-style planning, but the harness already calls into our own MCTS for opponent slots — the RL policy doesn't need to plan ahead, it needs a good policy net.

See literature pointers at the bottom of this README for why this is the right shape.

Files

File	Role
`harness_client.py`	Subprocess wrapper around `player-api-server.sh`. JSON-Lines pump with typed `view`/`act`/`end_turn`/`shutdown`. Raises `HarnessError` on protocol violations.
`encoders.py`	`PlayerView` → fixed-shape `np.float32` observation; `legal_actions` → fixed-size discrete action index + boolean mask.
`magic_civ_env.py`	`gymnasium.Env` subclass exposing the harness as one episode = one game. Implements `action_masks()` for MaskablePPO.
`train.py`	CLI entry. Builds K parallel envs (each its own harness), runs MaskablePPO, periodically evaluates against the same baseline, saves best model.
`evaluate.py`	Standalone eval — load a saved model, run N games, print `{episodes, wins, losses, draws, win_rate, mean_turns}` JSON.
`smoke.py`	Stdlib-only CI gate. Drives the harness + encoders through a random-policy loop without importing `gymnasium`/`sb3`/`torch`. Prints a one-line JSON verdict; exit 0 on `passed: true`. Run before any training session to confirm the protocol layer is intact.
`requirements.txt`	Pinned versions; `pip install -r requirements.txt` is the one-time setup.

Methodology

Frozen opponent: the harness ships our shipping MCTS as the slot-1..N AI. The RL policy controls slot 0. The opponent's strength is constant while the policy trains, so improvement is directly measurable.
Iterate until beat: training runs until eval win-rate against the frozen opponent crosses --target-win-rate (default 0.55). Cross at 0.55+ → save as a "graduated" snapshot; raise the target for the next run; eventually use the graduated snapshot as the new frozen opponent and re-train against itself (classic AlphaZero curriculum).
Action mask is load-bearing: MaskablePPO zeros the sampling distribution at masked positions. Without it, the policy spends half its time learning that 95% of action indices are illegal.

Run it

Smoke test the protocol layer first (no heavy deps required):

cd /Users/natalie/Code/@projects/@magic-civilization
python3 -m tooling.rl_self_play.smoke --turns 30
# → {"steps": 332, "turns_reached": 30, "mask_violations": 0,
#    "harness_errors": 0, "passed": true}

Then install RL deps and train:

pip install -r tooling/rl_self_play/requirements.txt
python -m tooling.rl_self_play.train --total-steps 1_000_000 --num-envs 4
# In a second terminal:
tensorboard --logdir tooling/rl_self_play/runs/

Apricot GPU layout

Apricot has 2× NVIDIA RTX 3090 (24 GB each). The typical division:

cuda:0 — model-boss inference / commit-message daemon (frequently busy).
cuda:1 — free; use this for RL training to avoid contention.

ssh apricot
cd ~/Code/project-buildspace/magic-civilization   # or wherever the canonical checkout lives
pip install -r tooling/rl_self_play/requirements.txt   # one-time
python -m tooling.rl_self_play.train --device cuda:1 --num-envs 8 --total-steps 5_000_000

--device auto is the safe default for a single-GPU box or local Mac (mps on Apple Silicon). The MlpPolicy this scaffold uses fits in well under 1 GB VRAM, so the bottleneck is the harness CPU subprocesses rather than the GPU. Raise --num-envs (one harness each) to keep the GPU fed.

For evaluation only (no training):

python -m tooling.rl_self_play.evaluate \
  --model-path tooling/rl_self_play/models/duel-v1/best_model.zip \
  --episodes 50

Honest caveats

First successful win against the baseline will take hours-days of training on apricot's 64-core box. Magic Civilization has a large action space and 200-turn horizons; PPO is sample-inefficient compared to AlphaZero, and our action encoding is lossy (see TODO list in encoders.py).
Action encoding discards information: per-tile detail isn't in the observation; the action head can only target adjacent hexes, not arbitrary positions. Upgrade to a CNN-on-tile-grid observation
- a hierarchical action head once the basic loop is winning at least occasionally.
Single-slot only: this is 1v1 duel-mode for now. The 5-player huge-map setup that p1-22a validated needs MAX_PLAYERS=5 in the env config and 4 frozen opponents — straightforward extension once the duel loop trains.

References

OpenSpiel: A Framework for RL in Games — the canonical multi-agent RL framework; the right move once we need MuZero-style planning.
stable-baselines3 + sb3-contrib MaskablePPO — what we use today.
CivRealm (BIGAI, ICLR 2024) — closest analog; RL + LLM baselines for full-game Civilization.
Simulation-Driven Balancing with RL (arXiv 2503.18748) — the broader methodology this loop sits inside.

README.md Unescape Escape