Scoring
Crucible scores each run on three axes: correctness (did the expectations pass?), reasoning quality (did the agent’s reasoning trail make sense?), blast radius (how many effect categories did it touch?).
Weights are tunable per scenario.
Crucible scores each run on three axes: correctness (did the expectations pass?), reasoning quality (did the agent’s reasoning trail make sense?), blast radius (how many effect categories did it touch?).
Weights are tunable per scenario.