docs(objectives): 📝 Update validation section in RL divergence mining objectives to reflect current state and add batch analysis details
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
parent
b7353d43e2
commit
32a694eab7
1 changed files with 63 additions and 13 deletions
|
|
@ -111,20 +111,70 @@ action priors (`action_prior_with_context`), p1-29d tuned combat damage
|
|||
min defenders → warrior; multi-city unaffected → warrior. `cargo test -p
|
||||
mc-ai --lib`: 265 pass.
|
||||
|
||||
## Validation (before/after autoplay batch)
|
||||
## Validation (before/after autoplay batch) — GATE NOT MET
|
||||
|
||||
Baseline: apricot batch `20260516_183534` — P1 buildings = 0 in 10/10,
|
||||
`tier_peak = 1` in 10/10, 0/10 gate.
|
||||
Local 10-seed T300 batch on the patched build:
|
||||
`.local/batches/p1_29e_after` (fresh GDExtension rebuild from working tree).
|
||||
|
||||
After (this patch): _<pending — filled by the validation batch below>_
|
||||
**Headline (do not be fooled):** vs the stale apricot baseline
|
||||
`20260516_183534`, P1 `tier_peak` rose 1 → 2-5 in **10/10** seeds. This is
|
||||
**NOT attributable to this patch** — it is main-branch drift. Three facts
|
||||
establish that:
|
||||
|
||||
```
|
||||
PARALLEL=4 bash tools/autoplay-batch.sh 10 300 .local/batches/p1_29e_after
|
||||
# analyze: P1 buildings_max > 0 ? P1 tier_peak >= 2 ? P1 median survival turn ?
|
||||
```
|
||||
1. `tier_peak` is defined as *highest tech-era researched*
|
||||
(`turn_processor.gd::_player_tier_peak`; mirrored in
|
||||
`auto_play.gd`). It is research-driven, NOT building- or unit-driven.
|
||||
2. This patch only adds **buildings**. P1 completed **zero buildings** in
|
||||
all 10 seeds (`player_stats.buildings` max = 0; only `owner=0` appears in
|
||||
`city_building_completed`). The patch produced no material change a
|
||||
research metric could reflect.
|
||||
3. The baseline `20260516_183534` is a *different commit*; comparing across
|
||||
it conflates this patch with all intervening main-branch changes.
|
||||
|
||||
Per p1-29d iteration discipline: any directional movement (P1 builds >0
|
||||
buildings; survival-turn or tier_peak rises) confirms the lever direction
|
||||
even if the full ≥7/10 gate does not land in one pass. If the gate is still
|
||||
0–2/10, this objective stays `partial` and the next iteration targets the
|
||||
production-scale handicap (candidate 3), not another priors tweak.
|
||||
So the apparent improvement is a measurement artifact of comparing against a
|
||||
stale-commit baseline. **The completion gate (a candidate validated by a
|
||||
before/after batch showing the metric move *because of the candidate*) is NOT
|
||||
met.**
|
||||
|
||||
### What the fresh batch actually revealed (more valuable than the patch)
|
||||
|
||||
Current-main P1 behaviour differs from the p1-29d baseline narrative:
|
||||
- P1 reaches `tier_peak` 2-5 by **pure research** (techs 9-35) — the old
|
||||
"P1 stuck at tier_peak=1" symptom is **already gone on current main**.
|
||||
- P1 still loses its capital in 8/10 (eliminated T44-100) with
|
||||
`kills=1-10`, `units_lost=1-4` — it fights but loses.
|
||||
- Survivor seeds 5/9 (T300/272, 1 city): `mil=0`, `buildings=0`,
|
||||
`pop=17/33`, `techs=31-35` — P1 researches to era 5 but builds **nothing
|
||||
material** for 250+ turns. Possible production stall worth its own
|
||||
investigation (snapshot-timing artifact vs genuine stall — unconfirmed).
|
||||
|
||||
### Why the patch did not demonstrably help
|
||||
|
||||
The break-out gate `own_mil >= SOLE_CITY_ECON_MIN_DEFENDERS (2)` requires the
|
||||
sole-city AI to hold ≥2 standing non-founder units at decision time. P1's
|
||||
`mil` snapshot is 0 at every recorded turn in 10/10 seeds (it fights via
|
||||
very-transient units between snapshots). Whether the gate ever fired is
|
||||
unconfirmed (the engine emits no production-queue event to detect it from
|
||||
batch artifacts). Either way the patch completed 0 buildings, so it had no
|
||||
observable effect — and the `own_mil>=2` floor may be exactly wrong for the
|
||||
weakest player.
|
||||
|
||||
### Honest status & next steps
|
||||
|
||||
- **Gate: NOT MET.** No metric movement attributable to this patch.
|
||||
- The patch is gated to `sole_city_threatened` and fully unit-tested
|
||||
(265 mc-ai tests green), so it is safe in-tree, but **unvalidated** — the
|
||||
consuming p1-29c/29d worker should validate or revert it.
|
||||
- **Remaining attribution step (deferred on host load):** controlled
|
||||
before/after on the *same* fresh build — HEAD vs HEAD+patch, same 10 seeds —
|
||||
is the only clean way to attribute (or refute) any effect. Held while the
|
||||
host runs ≥20 concurrent `godot-bin` (host guard); to run when load drops:
|
||||
```
|
||||
# baseline = revert the two production.rs edits, rebuild, run; then re-apply
|
||||
PARALLEL=3 bash tools/autoplay-batch.sh 10 300 .local/batches/p1_29e_before
|
||||
```
|
||||
- **Reframe for the next iteration:** the failure regime on current main is
|
||||
*survival with no military* and a possible *production stall*, not the old
|
||||
"military-spam, no economy". Re-baseline p1-29c/29d against current main
|
||||
before further patch work; the `tier_peak=1` symptom they target may already
|
||||
be resolved.
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue