Strawberry API

audit_trace_budget

Verify explicit reasoning steps before they become patches

What it does

audit_trace_budget verifies a structured trace of reasoning steps. Each step is a short claim with citations. The tool scores each step and flags the ones that are not supported by the cited evidence.

Use it when the output is a plan, decision log, root-cause analysis, or any reasoning you want to validate before code changes.

Inputs

steps (array): [{ claim: "...", cites: ["S0"] }]
spans (array): evidence spans [{ sid: "S0", text: "..." }]
verifier_model (string, default gpt-4o-mini)
default_target (float, default 0.95)
require_citations (bool, default false)
context_mode (string, default "cited"; accepts "all")
timeout_s (float, default 60.0)

Each step can optionally include confidence to override the default target for that specific claim.

How it works

Use explicit steps. You provide the claims and their citations, rather than letting the tool split sentences.
Select context. If context_mode is"cited", only cited spans are considered for each step.
Score each step. The tool computes whether each claim is supported by the evidence and reports a budget gap when it is not.

Verifier behavior

Uses strict textual entailment, not world knowledge
Only declarative assertions in the context can support a claim
Questions or instructions do not count as evidence

Outputs

The response includes:

flagged: whether any step failed verification
under_budget: mirrors flagged for this tool
summary: counts + verifier metadata
details: one entry per step with budgets and flags

{
  "flagged": false,
  "under_budget": false,
  "summary": {
    "steps_scored": 4,
    "flagged_steps": 0,
    "units": "bits",
    "verifier_model": "gpt-4o-mini",
    "backend": "openai"
  },
  "details": [
    {
      "idx": 0,
      "claim": "...",
      "cites": ["S0"],
      "flagged": false,
      "required": { "min": 4.2, "max": 6.1, "units": "bits" },
      "observed": { "min": 5.0, "max": 7.2, "units": "bits" },
      "budget_gap": { "min": -1.0, "max": -0.2, "units": "bits" }
    }
  ]
}

How to read the report

required: evidence budget needed to hit the target
observed: evidence budget actually observed
budget_gap: positive means under-supported, negative means enough support
missing_citations: set when require_citations=true and a step has no cites

Recommended settings

For merge gates: default_target=0.95 + require_citations=true
For exploratory planning: default_target=0.90
For strict evidence only: context_mode="cited"

Operational requirements

OPENAI_API_KEY is required for authentication
BERRY_SERVICE_URL can override the default service endpoint

Example call

audit_trace_budget(
  steps=[
    { idx: 0, claim: "Auth validates issuer via JWT_ISSUER", cites: ["S0"] }
  ],
  spans=[{ sid: "S0", text: "..." }],
  require_citations=true,
  context_mode="cited",
  default_target=0.95
)

When to use

Plans, decision traces, RCA reports, or reasoning steps
Any workflow where you want to verify reasoning before editing files

When not to use

Final prose answers (use detect_hallucination)
Outputs with no citations (unless you want them flagged)