Agent testing and validation framework — same scenario, multiple agents, judged side-by-side. Every decision a commit, every run a sealed epoch.
Built on ASG primitives · Decision commits · Sealed epochs · Judge agents · Auditable runs
Picking an agent for a job is a credentialing problem. Today the answer is vibes — a few demo runs, a leaderboard, a hunch. Crucible replaces vibes with side-by-side judgment over auditable runs: the same scenario, three agents, every decision a commit, a third judge agent scoring on correctness, reasoning, and blast radius.
Why this matters
Bootstrap stage — the v0 plan lives in CRUCIBLE.md. Here's the shape of the framework.
Three coding agents are in the running for a production role. Each gets the same scenario — same starting state, same task, same policy. Crucible records every decision each agent makes as a commit on its own branch.
A judge agent scores the runs side-by-side on correctness, reasoning quality, and effect blast radius. The result is a defensible answer, not a hunch.
Before promoting a new agent version, replay last quarter's scenarios. Crucible diffs decision commits against the sealed epoch from the prior version.
Any divergence — different reasoning, different effect set, different outcome — is flagged for review. Same PASS status doesn't count as "no change" if the agent got there a different way.
A regulator, stakeholder, or auditor asks the question. With a traditional test harness, you hand them a coverage report and a Slack thread.
With Crucible: hand them a sealed epoch. Tamper-evident hash chain, every decision, every alternative considered, every judge ruling, every policy evaluation. Independently verifiable, no trust required in the harness.
How it works
Crucible is built on AgentStateGraph — plans, policies, tasks, decision commits, blame, and sealed epochs are not invented here; they are the same primitives your agents already use.
Integrations
Get Started
What it is not
Crucible is an agent judgment framework — it asks how an agent reasoned, not just whether the test passed.
License
AgentStateCrucible is built to become infrastructure for agent validation. Infrastructure primitives are strip-mining targets: cloud providers offer them as managed services, capture the value, and contribute nothing back. BSL 1.1 closes that gap.
Part of a family
Crucible reuses ASG primitives. AgentStateGraph is the substrate, CTXone is the backing store, AgentStateDeveloper supplies the code-level ledger, AgentStateRouter supplies the routing.
Visit agentstategraph.dev GitLab