About Berry

A verification-only MCP server

Berry is three things: verification-only, evidence-required, and minimal. It exposes exactly two verification tools. It doesn't generate code, fetch evidence, or write files. It checks whether claims are supported by evidence you provide.

This page explains why Berry exists, what it does, and what it doesn't do.

Why verification

Here's the core bet: you can't prompt LLMs into being reliable. You must enforce verification at the tool boundary.

AI coding assistants are confident. Too confident. They'll tell you exactly how your codebase works—even when they're wrong. They'll generate "obviously correct" fixes that introduce subtle bugs. They'll cite functions that don't exist and patterns your repo doesn't use.

The industry response has been more prompting. System prompts that say "be careful." Fine-tuning for calibration. Guardrails that filter outputs after the fact.

None of it works reliably. The model will still hallucinate. It will still invent API calls, cite nonexistent documentation, and confidently describe code paths that aren't there.

The only reliable solution is to require evidence. If a claim isn't backed by a span of text you can point to, it gets flagged. The model then either finds evidence, narrows the claim, or admits it doesn't know.

Two tools

Berry exposes two verification tools:

detect_hallucination: Takes an answer with citations like [S1] and checks whether each sentence is supported by the cited evidence. Returns a per-claim breakdown with confidence scores and flags. Use this for Q&A, documentation, and any task where the output is text with claims.
audit_trace_budget: Takes a structured trace of reasoning steps—each step is a claim plus citations—and verifies each step has sufficient evidence. Use this when you want to catch "almost right" reasoning before it becomes a confident patch. Good for refactoring, bug fixes, and migrations.

Both tools require you to provide evidence spans. Berry doesn't fetch evidence—you do. This is deliberate: the evidence you provide is the evidence you trust.

Plus evidence collection tools: add_span, add_file_span,distill_span, and helpers to search and filter what you've gathered.

How it works

The verification flow:

Collect evidence as spans (code, docs, logs, API responses)
The assistant produces an answer with [S#] citations
Call detect_hallucination or audit_trace_budget
Berry scores each claim against the cited evidence
Flagged claims get revised or downgraded to assumptions

The key insight: citations must actually support the claim, not just exist. Berry uses an information-theoretic approach to measure whether the evidence provides enough support.

A claim that's "probably true anyway" (high prior) needs less evidence. A claim that's surprising or specific needs more. Berry measures this gap and flags claims where the evidence doesn't provide enough lift.

This catches:

Confident vibes with weak support
Citation laundering (citations that don't actually support the claim)
"Almost right" summaries where one detail is invented

What Berry is not

Not retrieval. Berry doesn't search your codebase or fetch docs. Your agent does that. Berry checks whether the claims are supported by the evidence you collected.

Not magic. If you don't provide evidence, there's nothing to verify. Berry is a gate, not a generator.

Not a prompt. Berry is an MCP server. It runs locally. Your IDE (Cursor, Claude Code, Codex, Gemini) talks to it. The verification happens in the tool call, not in the system prompt.

Not a framework. Berry doesn't orchestrate agents or manage conversations. It's a single checkpoint: are the claims supported? That's it.

Honest positioning

I'm not going to claim Berry eliminates all hallucinations. It doesn't. The verification model itself is an LLM, which means it can make mistakes.

What Berry does well:

Forces the assistant to cite evidence for factual claims
Catches citation laundering
Catches confident vibes with weak support
Catches invented details buried in otherwise-correct answers
Makes "I don't know" the default when evidence is missing

What Berry doesn't do:

Guarantee correctness (verification can be wrong)
Replace human review (it's a filter, not a guarantee)
Work without evidence (no spans = nothing to verify)
Catch errors in the evidence itself (garbage in, garbage out)

The goal is to shift the failure mode. Without Berry, the assistant says "Yes, definitely" when it should say "I don't know." With Berry, unsupported claims get flagged, and the assistant has to either find evidence or admit uncertainty.

This is a meaningful improvement. It's not a silver bullet.

Architecture

Berry is a local MCP server written in Python. It runs on your machine, in your repo, scoped to your project. Requires Python 3.10+.

When you run berry init, it writes config files for your MCP client. The client spawns the Berry server when you open the project.

Verification calls go to the OpenAI API (or any OpenAI-compatible endpoint). Berry sends:

Your claims (the answer or trace being verified)
Your evidence spans (the text you provided)

Treat spans like any LLM input: don't paste secrets, keep them minimal, redact sensitive data.

Berry stores run state in ~/.berry/runs/. Each run has a problem statement, deliverable, and collected spans. Runs persist across sessions so you can resume verification work.

Global config lives at ~/.berry/config.json. Safety defaults:allow_write: false, allow_exec: false. These are preserved for future tool expansions.

Roadmap

Berry is verification-first. The roadmap prioritizes making verification better before adding other capabilities.

Shipped

Evidence-based runs with problem/deliverable anchors
Span collection (add_span, add_file_span, distill_span)
Hallucination detection (detect_hallucination)
Trace budget auditing (audit_trace_budget)
Setup ergonomics (berry init, berry integrate)
Workflow playbooks (Search & Learn, Boilerplate, RCA Fix, etc.)
Client integration (Cursor, Claude Code, Codex, Gemini)

Evidence authenticity (HMAC-signed spans so only server-minted spans are citeable)
Export tools (bundle runs for debugging and compliance)

Later

Safe command capture (allowlisted runners storing output as trusted spans)
Policy packs (per-repo guardrails enforced server-side)
CI mode (headless verifier blocking merges on under-evidenced PR claims)

Why "Berry"?

The verification engine is called Strawberry. Berry is the server that wraps it. The name stuck.

If your assistant can't point to evidence, Berry makes it say:

"I don't know" instead of "Yes, definitely."

That's the whole point.