Skip to content

Introduction

AgentStateCrucible is an agent testing and validation framework built on AgentStateGraph (ASG) primitives: plans, policies, tasks, decision commits, blame, and sealed epochs.

Crucible runs the same scenario against multiple agents, captures every decision as an auditable graph, and uses a third judge agent to score runs side-by-side.

Picking an agent for a job is a credentialing problem. Today the answer is vibes — a few demo runs, a leaderboard scraped from someone else’s benchmark, a hunch. Crucible replaces vibes with side-by-side judgment over auditable runs:

  • Same scenario. Same starting state, same task, same policy. Every candidate agent gets the same inputs.
  • Every decision a commit. ASG decision commits capture intent, reasoning, confidence, alternatives, and authority. No “trust me, it worked.”
  • A third judge agent. An LLM judge scores runs on correctness, reasoning quality, and effect blast radius. Pluggable — bring your own judge.
  • A sealed epoch. The entire run is bundled into a tamper-evident Merkle-rooted epoch. Hand it to an auditor and they can verify it without trusting the harness.
  • CTXone — underlying state graph store (used as backing store)
  • AgentStateDeveloper — code-level ledger/effects
  • AgentStateRouter — agent routing
  • AgentStateGraph demo — the demo that birthed the validation concept

Bootstrap. See CRUCIBLE.md in the repo for the v0 plan.