bootstrap · BSL 1.1

AgentStateCrucible

Agent testing and validation framework — same scenario, multiple agents, judged side-by-side. Every decision a commit, every run a sealed epoch.

Built on ASG primitives · Decision commits · Sealed epochs · Judge agents · Auditable runs

View on GitLab Quick Start Docs

Picking an agent for a job is a credentialing problem. Today the answer is vibes — a few demo runs, a leaderboard, a hunch. Crucible replaces vibes with side-by-side judgment over auditable runs: the same scenario, three agents, every decision a commit, a third judge agent scoring on correctness, reasoning, and blast radius.

Three moments Crucible earns its keep

Bootstrap stage — the v0 plan lives in CRUCIBLE.md. Here's the shape of the framework.

Story 1 — The bake-off

"Which agent is actually better at this?"

Three coding agents are in the running for a production role. Each gets the same scenario — same starting state, same task, same policy. Crucible records every decision each agent makes as a commit on its own branch.

A judge agent scores the runs side-by-side on correctness, reasoning quality, and effect blast radius. The result is a defensible answer, not a hunch.

crucible run billing-refactor JUDGED
agent-a
In-place rewrite, broke 2 tests
judge: 0.51
agent-b ✓
Strangler-fig, ledger entries clean
judge: 0.89
agent-c
Over-refactored, blast radius +60%
judge: 0.42
Winner: agent-b — "Smallest blast radius, cleanest reasoning trail, no test regressions"
Story 2 — The regression check

"Did the regression actually regress?"

Before promoting a new agent version, replay last quarter's scenarios. Crucible diffs decision commits against the sealed epoch from the prior version.

Any divergence — different reasoning, different effect set, different outcome — is flagged for review. Same PASS status doesn't count as "no change" if the agent got there a different way.

📊 crucible diff --epoch q1-2026 v2.0 v2.1 DIVERGENCE
scenario 12PASSno change
scenario 23PASSno change
scenario 47PASS*reasoning 78%
scenario 51PASS*effects +2 cat
scenario 88PASSno change
2 silent divergences flagged for review
Story 3 — The audit

"How do you know this agent works?"

A regulator, stakeholder, or auditor asks the question. With a traditional test harness, you hand them a coverage report and a Slack thread.

With Crucible: hand them a sealed epoch. Tamper-evident hash chain, every decision, every alternative considered, every judge ruling, every policy evaluation. Independently verifiable, no trust required in the harness.

🔒 epoch crucible-2026-05-30 SEALED
scenarios: 120
agents: agent-a, agent-b, agent-c
judge: opus-judge-v2
decisions: 4,712
policy_evals: 1,084
root_hash: ab3f…91
signed: ed25519 · craig@
VERIFIED · independently replayable

Nine primitives, one substrate

Crucible is built on AgentStateGraph — plans, policies, tasks, decision commits, blame, and sealed epochs are not invented here; they are the same primitives your agents already use.

scenario
What's the test?
plan
How will the agent attempt it?
task
What's the unit of work?
policy
What's allowed?
decision_commit
What did the agent decide?
blame
Who decided what, when?
sealed_epoch
Can it be audited?
judge_score
How well did it go?
divergence
Did this version drift?

Plug-in surfaces

Python (core) pytest harness MCP integration Built on AgentStateGraph CTXone backing store Judge agents (LLM-pluggable) CLI JSONL run logs

Run a bake-off in 60 seconds

terminal
 

Clear on scope

Crucible is an agent judgment framework — it asks how an agent reasoned, not just whether the test passed.

NOT a benchmark suite
Benchmarks score on someone else's problems with frozen tests. Crucible runs your scenarios against your agents, with full decision trails.
NOT a test runner
pytest checks code. Crucible judges agent behavior — reasoning, alternatives considered, effects taken — side-by-side against peers.

Why BSL 1.1?

AgentStateCrucible is built to become infrastructure for agent validation. Infrastructure primitives are strip-mining targets: cloud providers offer them as managed services, capture the value, and contribute nothing back. BSL 1.1 closes that gap.

Individuals, startups, and enterprises using Crucible internally — including in production validation pipelines — are unaffected.
The restriction covers one specific case: offering Crucible as a hosted validation service to third parties without a commercial agreement.
After four years, each version converts to Apache 2.0 permanently, with no conditions.

Built on AgentStateGraph

Crucible reuses ASG primitives. AgentStateGraph is the substrate, CTXone is the backing store, AgentStateDeveloper supplies the code-level ledger, AgentStateRouter supplies the routing.

Visit agentstategraph.dev GitLab