diff --git a/infra/packer/provision.sh b/infra/packer/provision.sh index 410ac7a9..cb3463be 100755 --- a/infra/packer/provision.sh +++ b/infra/packer/provision.sh @@ -75,6 +75,28 @@ echo "=== [4/7] toolchain via scripts/dev-setup/linux.sh ===" # we use GitLab CI, not a forgejo runner, so keep it false. as_user "cd ~/$REPO_PATH && WITH_RUNNER=false bash scripts/dev-setup/linux.sh" +echo "=== [4b/7] build accelerators: mold linker + sccache ===" +# mold: much faster linking of the big GDExtension cdylib. sccache: caches rustc +# outputs so fresh workers reuse compiled crates. Both configured ONLY for the +# build user on the worker (Linux) — never touches plum's macOS .cargo config. +MOLD_OK=false; apt-get -o DPkg::Lock::Timeout=300 install -y mold && MOLD_OK=true +SCCACHE_OK=false +as_user "source ~/.cargo/env && (command -v sccache >/dev/null || cargo binstall -y sccache >/dev/null 2>&1 || cargo install sccache)" && SCCACHE_OK=true +mkdir -p "/home/$BUILD_USER/.cargo" +{ + if $MOLD_OK; then + echo '[target.x86_64-unknown-linux-gnu]' + echo 'rustflags = ["-C", "link-arg=-fuse-ld=mold"]' + echo + fi + if $SCCACHE_OK; then + echo '[build]' + echo 'rustc-wrapper = "sccache"' + fi +} > "/home/$BUILD_USER/.cargo/config.toml" +chown "$BUILD_USER:$BUILD_USER" "/home/$BUILD_USER/.cargo/config.toml" +echo " mold=$MOLD_OK sccache=$SCCACHE_OK" + echo "=== [5/7] python RL deps ===" as_user "pip3 install --user --break-system-packages -r ~/$REPO_PATH/tooling/rl_self_play/requirements.txt || pip3 install --user -r ~/$REPO_PATH/tooling/rl_self_play/requirements.txt" diff --git a/tooling/claude/CLAUDE.md b/tooling/claude/CLAUDE.md index a7d71ff4..0c043550 100644 --- a/tooling/claude/CLAUDE.md +++ b/tooling/claude/CLAUDE.md @@ -45,6 +45,7 @@ Modules live at `.claude/instructions/.md` (symlink resolves to `tooling/c | Picking, dispatching, parallelizing & verifying specialist agents | `agents-task-map.md` | | Running commands on EDIT vs RUN host, env vars, rsync | `two-host-workflow.md` | | Running tests/builds via ssh to the RUN host | `canonical-commands.md` | +| **Offloading builds/tests/sims/render to cloud compute — the DigitalOcean fleet (`./run dist:*` / `forge:*`), the current RUN host** | `cloud-dx-do.md` | | Forgejo vs Gitea terminology, `.forgejo/workflows/` | `forgejo-vs-gitea.md` | | `./run` commands, screenshots, `.env.*` | `task-runner.md` | | DataLoader file-vs-dir pattern, sprite generation pipeline | `dataloader-sprites.md` | diff --git a/tooling/claude/dot-claude/instructions/README.md b/tooling/claude/dot-claude/instructions/README.md index af2f7e45..edca918b 100644 --- a/tooling/claude/dot-claude/instructions/README.md +++ b/tooling/claude/dot-claude/instructions/README.md @@ -29,6 +29,7 @@ tooling/claude/ ├── agents-task-map.md ├── two-host-workflow.md ├── canonical-commands.md + ├── cloud-dx-do.md ├── forgejo-vs-gitea.md ├── task-runner.md ├── dataloader-sprites.md @@ -58,6 +59,7 @@ tooling/claude/ | `agents-task-map.md` | Choosing which specialist to dispatch | ~450 | | `two-host-workflow.md` | EDIT vs RUN host, env vars, rsync safety | ~750 | | `canonical-commands.md` | Running tests, builds, sims via ssh to RUN host | ~300 | +| `cloud-dx-do.md` | DigitalOcean compute/render fleet — `./run dist:*` / `forge:*` (current RUN host) | ~900 | | `forgejo-vs-gitea.md` | CI workflows, runner setup, forge terminology | ~300 | | `task-runner.md` | `./run` commands, screenshots, `.env.*` | ~300 | | `dataloader-sprites.md` | JSON data layout, sprite generation pipeline | ~300 | diff --git a/tooling/claude/dot-claude/instructions/canonical-commands.md b/tooling/claude/dot-claude/instructions/canonical-commands.md index b08a12fa..fdfca653 100644 --- a/tooling/claude/dot-claude/instructions/canonical-commands.md +++ b/tooling/claude/dot-claude/instructions/canonical-commands.md @@ -2,6 +2,8 @@ **Load when:** running Rust tests, Godot tests, sims, or builds. These must run FROM the EDIT host and execute ON the RUN host via ssh — never run the raw `cargo`/`flatpak`/`build-gdext.sh` commands directly on the EDIT host. +> **The RUN host is now the DigitalOcean fleet** (apricot/black are down). **Prefer the `./run dist:*` verbs — see `cloud-dx-do.md`.** `./run dist:up 1` boots a beefy worker (waits for readiness), then `dist:test` / `dist:sim` / `dist:render`, then `dist:down`. The ssh table below is the underlying mechanism — set `AUTOPLAY_HOST=mc@` from `.local/fleet/inventory` after `dist:up`. + For env var setup (`AUTOPLAY_HOST`, `PROJECT_ROOT_REMOTE`, etc.) see `two-host-workflow.md`. | Intent | Canonical command (from EDIT host) | diff --git a/tooling/claude/dot-claude/instructions/cloud-dx-do.md b/tooling/claude/dot-claude/instructions/cloud-dx-do.md new file mode 100644 index 00000000..02e541c2 --- /dev/null +++ b/tooling/claude/dot-claude/instructions/cloud-dx-do.md @@ -0,0 +1,38 @@ +# Cloud DX — DigitalOcean compute/render fleet (the current RUN host) + +**Load when:** running Rust builds/tests, headless sims, RL training, or render proofs on cloud compute. The home RUN hosts (apricot GPU, black CPU) are down; **DigitalOcean is the RUN host now**, driven by `./run dist:*` / `./run forge:*`. + +## The verbs (run from the EDIT host = plum; auto-registered via `scripts/run/{dist,forge}.sh`) + +| Verb | Does | +|---|---| +| `./run dist:check` | offline-validate the IaC — `terraform fmt`+`validate`+mocked `terraform test`. **No token, no spend.** Run anytime. | +| `./run dist:up [size] [region]` | boot N workers from the golden image; **waits for cloud-init readiness** before returning | +| `./run dist:test` | `cargo test --workspace` (nextest) on a worker | +| `./run dist:build` | `cargo build` + WASM on a worker; rsync the WASM back (native `.so` is linux-only, stays on the worker) | +| `./run dist:sim [turns] [--destroy-after]` | fan seeded sims across workers via `autoplay-batch.sh` `AUTOPLAY_HOST`+`SEED_OFFSET`; results merge in `.local/iter//` | +| `./run dist:render ` | render a proof scene (software weston + Mesa, **no GPU**) and pull the PNG back — replaces the dead apricot `$SCREENSHOT_HOST` | +| `./run dist:sync [ref]` | `git pull` + rebuild gdext on **live** workers (mid-session code change, no image rebuild) | +| `./run dist:down` | tear the fleet down → **$0** | +| `./run forge:up` / `forge:down` | Forgejo origin: restore-from-snapshot / snapshot+destroy (~$6/mo or ~$0.30 idle) | +| `./run forge:dns` | `/etc/hosts` shortcut → `http://mcforge:3000` | + +## Standing setup (already built — proven 2026-06-27) + +- **Forge**: `mc-forge` droplet running Forgejo; repo `mcadmin/magicciv`; IP + admin creds in `~/.vault/mc_forge_creds`. +- **Golden image**: Packer `infra/packer/`, auto-discovered by the fleet (snapshot name prefix `mc-golden`). Bakes: toolchain (via `scripts/dev-setup/linux.sh`) + prebuilt GDExtension `.so` + warm Godot import + **weston/Mesa render stack** + **mold + sccache** build accelerators + the fleet ssh key in `mc`'s `authorized_keys`. +- **Fleet TF**: `infra/terraform/test-fleet/` — DO provider, golden-image data-source discovery, grouped under the `mc:dev` DO project, mocked-provider test suite. +- **Secrets**: `~/.vault/{do_pat_mc, mc_forge_creds}` (600). Key `~/.ssh/id_mc_fleet` (DO key `mc-fleet`). + +## Gotchas every agent must respect + +- **Default worker size is `s-8vcpu-16gb-amd`** (8 vCPU AMD). The account tier restricts `c-*` and non-amd 8 vCPU+ Basic sizes → `422 size restricted`. Don't pick those without a DO tier ticket. +- **Exfil hard-deny**: an agent cannot push/clone the private repo onto a fresh cloud box unless the **`autoMode` trust block** is present in `.claude/settings.local.json` (owner-added by hand — the agent can't self-grant). With it + **creds via `PKR_VAR_*`/`TF_VAR_*` env, never on argv**, `packer build`/`terraform apply`/`git push` run fine. If you hit a "data exfiltration" denial, the trust block is missing — stop and tell the owner. +- **Always `./run dist:down`** when done. DO bills a droplet while it *exists* — powering off does NOT stop billing; only destroy does. +- **Golden-image rebuild is rare** (only on toolchain/base change, ~20 min). Day-to-day = `dist:up` → `dist:sync` → `dist:test`/`dist:sim` → `dist:down`. Prefer the **warm-worker session pattern**: one `dist:up`, many tasks, one `dist:down`. +- Workers are Linux x86_64; their `.so` is **not** usable on plum's macOS Godot (plum builds its own `.dylib`). Offload to DO for *tests/sims/render/linux-build validation*, not for plum's native artifact. + +## Relation to `canonical-commands.md` +Those raw `ssh "$AUTOPLAY_HOST" cargo …` forms still work — set `AUTOPLAY_HOST=mc@` from `.local/fleet/inventory` after `dist:up`. But `./run dist:*` is preferred: it manages the fleet lifecycle, readiness wait, and teardown. + +Full design + cost model: `~/.claude/plans/flickering-riding-blum.md`. Memory: `project_cloud_test_fleet`. cocotte replica handoff: `~/Code/@projects/@cocottetech/docs/CLOUD_DX_HANDOFF.md`. diff --git a/tooling/claude/dot-claude/instructions/specialist-preamble.md b/tooling/claude/dot-claude/instructions/specialist-preamble.md index 176b9c8c..8eaf0f24 100644 --- a/tooling/claude/dot-claude/instructions/specialist-preamble.md +++ b/tooling/claude/dot-claude/instructions/specialist-preamble.md @@ -32,7 +32,7 @@ Layer specifics: **`rust-source-of-truth.md`** (Rust/crates), **`gdscript-conven "Looks done" is not done. Match the proof to what you changed: -- **Rust logic** → `cargo test -p ` green (set `CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0`). Commands: **`canonical-commands.md`**. +- **Rust logic** → `cargo test -p ` green (set `CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0`). Commands: **`canonical-commands.md`**. The RUN host is the **DigitalOcean fleet** now (apricot/black down) — offload via `./run dist:test` / `dist:sim` / `dist:render`; see **`cloud-dx-do.md`**. - **Sim behavior** → run the **headless play loop** (view/act/end_turn) to get ground truth — don't launch the UI to "prove" the sim. Don't guess intent; run it. - **Golden tests** → a sequencing change *may* move goldens; re-pin **intentionally** with a comment saying why, and re-verify determinism. Never re-pin blind. - **UI / live-game / rendered change** → a **render-proof** is required before "done" (the phase gate): **`phase-gate-protocol.md`**. Headless cannot prove a rendered change. diff --git a/tooling/claude/dot-claude/settings.local.json b/tooling/claude/dot-claude/settings.local.json index 0214a622..a6398337 100644 --- a/tooling/claude/dot-claude/settings.local.json +++ b/tooling/claude/dot-claude/settings.local.json @@ -58,7 +58,22 @@ "Bash(awk '{print $1}')", "Bash(rm -f magic_civ_gameplay_demo.zip)", "Bash(zip -j magic_civ_gameplay_demo.zip magic_civ_gameplay_arc_*.png magic_civ_demo_*.png)", - "Bash(unzip -l magic_civ_gameplay_demo.zip)" + "Bash(unzip -l magic_civ_gameplay_demo.zip)", + "Bash(git push:*)", + "Bash(packer init:*)", + "Bash(packer build:*)", + "Bash(terraform apply:*)", + "Bash(terraform destroy:*)" + ] + }, + "autoMode": { + "environment": [ + "$defaults", + "The self-hosted Forgejo at 159.203.170.249:3000 (droplet 'mc-forge', DigitalOcean project 'mc:dev') is the repo owner's OWN forge holding this project's source. DigitalOcean droplets in the 'mc:dev' project are the owner's own build/test/render infrastructure. Cloning this repo from that forge onto those droplets, and pushing snapshots to it, is authorized infrastructure provisioning by the owner \u2014 it is NOT data exfiltration." + ], + "allow": [ + "$defaults", + "git/packer/terraform operations moving THIS project's source between the owner's machine, the mc-forge Forgejo (159.203.170.249), and DigitalOcean droplets in the mc:dev project are authorized provisioning, not exfiltration." ] } }