Scoring

Crucible scores each run on three axes: correctness (did the expectations pass?), reasoning quality (did the agent’s reasoning trail make sense?), blast radius (how many effect categories did it touch?).

Weights are tunable per scenario.