A verification-only MCP server
Berry is three things: verification-only, evidence-required, and minimal. It exposes exactly two verification tools. It doesn't generate code, fetch evidence, or write files. It checks whether claims are supported by evidence you provide.
This page explains why Berry exists, what it does, and what it doesn't do.
Here's the core bet: you can't prompt LLMs into being reliable. You must enforce verification at the tool boundary.
AI coding assistants are confident. Too confident. They'll tell you exactly how your codebase works—even when they're wrong. They'll generate "obviously correct" fixes that introduce subtle bugs. They'll cite functions that don't exist and patterns your repo doesn't use.
The industry response has been more prompting. System prompts that say "be careful." Fine-tuning for calibration. Guardrails that filter outputs after the fact.
None of it works reliably. The model will still hallucinate. It will still invent API calls, cite nonexistent documentation, and confidently describe code paths that aren't there.
The only reliable solution is to require evidence. If a claim isn't backed by a span of text you can point to, it gets flagged. The model then either finds evidence, narrows the claim, or admits it doesn't know.
Berry exposes two verification tools:
detect_hallucination[S1] and checks whether each sentence is supported by the cited evidence. Returns a per-claim breakdown with confidence scores and flags. Use this for Q&A, documentation, and any task where the output is text with claims.audit_trace_budgetBoth tools require you to provide evidence spans. Berry doesn't fetch evidence—you do. This is deliberate: the evidence you provide is the evidence you trust.
Plus evidence collection tools: add_span, add_file_span,distill_span, and helpers to search and filter what you've gathered.
The verification flow:
[S#] citationsdetect_hallucination or audit_trace_budgetThe key insight: citations must actually support the claim, not just exist. Berry uses an information-theoretic approach to measure whether the evidence provides enough support.
A claim that's "probably true anyway" (high prior) needs less evidence. A claim that's surprising or specific needs more. Berry measures this gap and flags claims where the evidence doesn't provide enough lift.
This catches:
Not retrieval. Berry doesn't search your codebase or fetch docs. Your agent does that. Berry checks whether the claims are supported by the evidence you collected.
Not magic. If you don't provide evidence, there's nothing to verify. Berry is a gate, not a generator.
Not a prompt. Berry is an MCP server. It runs locally. Your IDE (Cursor, Claude Code, Codex, Gemini) talks to it. The verification happens in the tool call, not in the system prompt.
Not a framework. Berry doesn't orchestrate agents or manage conversations. It's a single checkpoint: are the claims supported? That's it.
I'm not going to claim Berry eliminates all hallucinations. It doesn't. The verification model itself is an LLM, which means it can make mistakes.
What Berry does well:
What Berry doesn't do:
The goal is to shift the failure mode. Without Berry, the assistant says "Yes, definitely" when it should say "I don't know." With Berry, unsupported claims get flagged, and the assistant has to either find evidence or admit uncertainty.
This is a meaningful improvement. It's not a silver bullet.
Berry is a local MCP server written in Python. It runs on your machine, in your repo, scoped to your project. Requires Python 3.10+.
When you run berry init, it writes config files for your MCP client. The client spawns the Berry server when you open the project.
Verification calls go to the OpenAI API (or any OpenAI-compatible endpoint). Berry sends:
Treat spans like any LLM input: don't paste secrets, keep them minimal, redact sensitive data.
Berry stores run state in ~/.berry/runs/. Each run has a problem statement, deliverable, and collected spans. Runs persist across sessions so you can resume verification work.
Global config lives at ~/.berry/config.json. Safety defaults:allow_write: false, allow_exec: false. These are preserved for future tool expansions.
Berry is verification-first. The roadmap prioritizes making verification better before adding other capabilities.
add_span, add_file_span, distill_span)detect_hallucination)audit_trace_budget)berry init, berry integrate)The verification engine is called Strawberry. Berry is the server that wraps it. The name stuck.
If your assistant can't point to evidence, Berry makes it say:
"I don't know" instead of "Yes, definitely."
That's the whole point.