Exposed by a new hotseat full-game driver (drives both player seats over the
multi-slot wire, no AI dependency) — a 31-turn 2-player game surfaced these.
- mc-player-api: the AI→PlayerAction converter (apply_ai_action + the suggest
sibling) emitted the bare tactical city index ("0") for QueueProduction, but
find_city_indices needs the projector wire id "{player}_{c_idx}" — so every
AI/suggested queue_production failed UnknownCity. This silently broke the
in-box AI's production-steering, not just the wire. Emit the wire id at all
three sites; thread slot into the suggest converter; add a regression test.
Result in the playthrough: roundtrip failures 58→1, city_building_completed 0→18.
- api-gdext: advance_round_phase/end_player_round_phase did not compile at HEAD —
godot-rust 0.2.4 Array::push needs &Dictionary (AsArg); Pcg64 builds via ::seed
not ::seed_from_u64; dropped a dead rng binding. The gdext crate could not be
rebuilt from source until this.
- mc-worldsim: pub use GamePhase/RoundPhase (api-gdext references them through
mc_worldsim; they were a private re-export → E0603).
- tooling: add hotseat_playthrough.py — applies each seat's suggested actions
and flags any offered action that fails to apply, with severity triage.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| tests | ||
| __init__.py | ||
| _export_onnx_p1_29f.py | ||
| bc_pretrain.py | ||
| capture_encoder_fixtures.py | ||
| diag_suggest.py | ||
| encoders.py | ||
| evaluate.py | ||
| harness_client.py | ||
| hotseat_playthrough.py | ||
| magic_civ_env.py | ||
| mine_divergence.py | ||
| opponent.py | ||
| README.md | ||
| record_expert.py | ||
| requirements.txt | ||
| smoke.py | ||
| smoke_model_opponent.py | ||
| smoke_multi_slot.py | ||
| train.py | ||
Magic Civilization RL self-play
Open-source-RL alternative to the cloud-LLM "Claude plays the game"
loop. Wraps scripts/player-api-server.sh (the generic JSON-Lines
player-API harness) as a Gymnasium environment, then trains a
MaskablePPO policy against the harness's built-in AI as a frozen
opponent. Reports win-rate against the baseline so we can see
exactly when the trained policy beats our shipping MCTS.
Why this stack
- OpenSpiel was the obvious first choice for multi-agent general-sum
with built-in AlphaZero/MuZero, but the action space requires a
custom
GameC++ wrapper or an awkward Python-side adapter — too much boilerplate for the iteration we want. - alpha-zero-general is 2-player-only and doesn't compose with Magic Civilization's diplomacy actions cleanly.
- stable-baselines3 + sb3-contrib MaskablePPO with the harness as a Gym env gets us a working loop in three files of Python with action-masking out of the box. We give up MuZero-style planning, but the harness already calls into our own MCTS for opponent slots — the RL policy doesn't need to plan ahead, it needs a good policy net.
See literature pointers at the bottom of this README for why this is the right shape.
Files
| File | Role |
|---|---|
harness_client.py |
Subprocess wrapper around player-api-server.sh. JSON-Lines pump with typed view/act/end_turn/shutdown. Raises HarnessError on protocol violations. |
encoders.py |
PlayerView → fixed-shape np.float32 observation; legal_actions → fixed-size discrete action index + boolean mask. |
magic_civ_env.py |
gymnasium.Env subclass exposing the harness as one episode = one game. Implements action_masks() for MaskablePPO. |
train.py |
CLI entry. Builds K parallel envs (each its own harness), runs MaskablePPO, periodically evaluates against the same baseline, saves best model. |
evaluate.py |
Standalone eval — load a saved model, run N games, print {episodes, wins, losses, draws, win_rate, mean_turns} JSON. |
smoke.py |
Stdlib-only CI gate. Drives the harness + encoders through a random-policy loop without importing gymnasium/sb3/torch. Prints a one-line JSON verdict; exit 0 on passed: true. Run before any training session to confirm the protocol layer is intact. |
requirements.txt |
Pinned versions; pip install -r requirements.txt is the one-time setup. |
Methodology
- Frozen opponent: the harness ships our shipping MCTS as the slot-1..N AI. The RL policy controls slot 0. The opponent's strength is constant while the policy trains, so improvement is directly measurable.
- Iterate until beat: training runs until eval win-rate against
the frozen opponent crosses
--target-win-rate(default 0.55). Cross at 0.55+ → save as a "graduated" snapshot; raise the target for the next run; eventually use the graduated snapshot as the new frozen opponent and re-train against itself (classic AlphaZero curriculum). - Action mask is load-bearing: MaskablePPO zeros the sampling distribution at masked positions. Without it, the policy spends half its time learning that 95% of action indices are illegal.
Run it
Smoke test the protocol layer first (no heavy deps required):
cd /Users/natalie/Code/@projects/@magic-civilization
python3 -m tooling.rl_self_play.smoke --turns 30
# → {"steps": 332, "turns_reached": 30, "mask_violations": 0,
# "harness_errors": 0, "passed": true}
Then install RL deps and train:
pip install -r tooling/rl_self_play/requirements.txt
python -m tooling.rl_self_play.train --total-steps 1_000_000 --num-envs 4
# In a second terminal:
tensorboard --logdir tooling/rl_self_play/runs/
Apricot GPU layout
Apricot has 2× NVIDIA RTX 3090 (24 GB each). The typical division:
cuda:0— model-boss inference / commit-message daemon (frequently busy).cuda:1— free; use this for RL training to avoid contention.
ssh apricot
cd ~/Code/project-buildspace/magic-civilization # or wherever the canonical checkout lives
pip install -r tooling/rl_self_play/requirements.txt # one-time
python -m tooling.rl_self_play.train --device cuda:1 --num-envs 8 --total-steps 5_000_000
--device auto is the safe default for a single-GPU box or local Mac
(mps on Apple Silicon). The MlpPolicy this scaffold uses fits in
well under 1 GB VRAM, so the bottleneck is the harness CPU subprocesses
rather than the GPU. Raise --num-envs (one harness each) to keep
the GPU fed.
For evaluation only (no training):
python -m tooling.rl_self_play.evaluate \
--model-path tooling/rl_self_play/models/duel-v1/best_model.zip \
--episodes 50
Honest caveats
- First successful win against the baseline will take hours-days
of training on apricot's 64-core box. Magic Civilization has a
large action space and 200-turn horizons; PPO is sample-inefficient
compared to AlphaZero, and our action encoding is lossy (see TODO
list in
encoders.py). - Action encoding discards information: per-tile detail isn't in
the observation; the action head can only target adjacent hexes,
not arbitrary positions. Upgrade to a CNN-on-tile-grid observation
- a hierarchical action head once the basic loop is winning at least occasionally.
- Single-slot only: this is 1v1 duel-mode for now. The 5-player
huge-map setup that
p1-22avalidated needsMAX_PLAYERS=5in the env config and 4 frozen opponents — straightforward extension once the duel loop trains.
References
- OpenSpiel: A Framework for RL in Games — the canonical multi-agent RL framework; the right move once we need MuZero-style planning.
- stable-baselines3 + sb3-contrib MaskablePPO — what we use today.
- CivRealm (BIGAI, ICLR 2024) — closest analog; RL + LLM baselines for full-game Civilization.
- Simulation-Driven Balancing with RL (arXiv 2503.18748) — the broader methodology this loop sits inside.