AgentStateCrucible

Picking an agent for a job is a credentialing problem. Today the answer is vibes — a few demo runs, a leaderboard, a hunch. Crucible replaces vibes with side-by-side judgment over auditable runs: the same scenario, three agents, every decision a commit, a third judge agent scoring on correctness, reasoning, and blast radius.

Why this matters

Three moments Crucible earns its keep

Bootstrap stage — the v0 plan lives in CRUCIBLE.md. Here's the shape of the framework.

Story 1 — The bake-off

"Which agent is actually better at this?"

Three coding agents are in the running for a production role. Each gets the same scenario — same starting state, same task, same policy. Crucible records every decision each agent makes as a commit on its own branch.

A judge agent scores the runs side-by-side on correctness, reasoning quality, and effect blast radius. The result is a defensible answer, not a hunch.

⚒ crucible run billing-refactor JUDGED

agent-a

In-place rewrite, broke 2 tests

judge: 0.51

agent-b ✓

Strangler-fig, ledger entries clean

judge: 0.89

agent-c

Over-refactored, blast radius +60%

judge: 0.42

Winner: agent-b — "Smallest blast radius, cleanest reasoning trail, no test regressions"

Story 2 — The regression check

"Did the regression actually regress?"

Before promoting a new agent version, replay last quarter's scenarios. Crucible diffs decision commits against the sealed epoch from the prior version.

Any divergence — different reasoning, different effect set, different outcome — is flagged for review. Same PASS status doesn't count as "no change" if the agent got there a different way.

📊 crucible diff --epoch q1-2026 v2.0 v2.1 DIVERGENCE

scenario 12PASSno change

scenario 23PASSno change

scenario 47PASS*reasoning 78%

scenario 51PASS*effects +2 cat

scenario 88PASSno change

2 silent divergences flagged for review

Story 3 — The audit

"How do you know this agent works?"

A regulator, stakeholder, or auditor asks the question. With a traditional test harness, you hand them a coverage report and a Slack thread.

With Crucible: hand them a sealed epoch. Tamper-evident hash chain, every decision, every alternative considered, every judge ruling, every policy evaluation. Independently verifiable, no trust required in the harness.

🔒 epoch crucible-2026-05-30 SEALED

scenarios: 120

agents: agent-a, agent-b, agent-c

judge: opus-judge-v2

decisions: 4,712

policy_evals: 1,084

root_hash: ab3f…91

signed: ed25519 · craig@

VERIFIED · independently replayable

How it works

Nine primitives, one substrate

Crucible is built on AgentStateGraph — plans, policies, tasks, decision commits, blame, and sealed epochs are not invented here; they are the same primitives your agents already use.

scenario

What's the test?

plan

How will the agent attempt it?

task

What's the unit of work?

policy

What's allowed?

decision_commit

What did the agent decide?

blame

Who decided what, when?

sealed_epoch

Can it be audited?

judge_score

How well did it go?

divergence

Did this version drift?

Integrations

Plug-in surfaces

Python (core) pytest harness MCP integration Built on AgentStateGraph CTXone backing store Judge agents (LLM-pluggable) CLI JSONL run logs

Get Started

Run a bake-off in 60 seconds

terminal

What it is not

Clear on scope

Crucible is an agent judgment framework — it asks how an agent reasoned, not just whether the test passed.

NOT a benchmark suite

Benchmarks score on someone else's problems with frozen tests. Crucible runs your scenarios against your agents, with full decision trails.

NOT a test runner

pytest checks code. Crucible judges agent behavior — reasoning, alternatives considered, effects taken — side-by-side against peers.

License

Why BSL 1.1?

AgentStateCrucible is built to become infrastructure for agent validation. Infrastructure primitives are strip-mining targets: cloud providers offer them as managed services, capture the value, and contribute nothing back. BSL 1.1 closes that gap.

✓ Individuals, startups, and enterprises using Crucible internally — including in production validation pipelines — are unaffected.

⚠ The restriction covers one specific case: offering Crucible as a hosted validation service to third parties without a commercial agreement.

↺ After four years, each version converts to Apache 2.0 permanently, with no conditions.

Part of a family

Built on AgentStateGraph

Crucible reuses ASG primitives. AgentStateGraph is the substrate, CTXone is the backing store, AgentStateDeveloper supplies the code-level ledger, AgentStateRouter supplies the routing.

Visit agentstategraph.dev