Hallucination resistance benchmark. A ceiling at 99.5% and one real signal inside it.
A note on modes: these benchmarks ran on the lean mode of the API. The current default is presence, which is optimized for different work and will be benchmarked separately in v3. Both modes are available via the tone parameter.
The uncertainty license — "say so plainly rather than guess" — is worth three percentage points on the question type that matters most. Lean beats a generic helpful-assistant prompt by 1.0pp overall and 3.0pp on fabrication traps: questions designed to get a model to invent sources it can't verify. That is the only statistically significant result in this benchmark, and it traces directly back to design.
The ceiling is the other finding. All four conditions scored above 98.7%. Claude Haiku is already well-calibrated on this task. Unknowable specifics — real-time prices, private data, future events — scored a perfect 1.000 across every condition. There is a floor on how much a system prompt can improve something the model already does almost perfectly. Here is the full picture.
The Three Question Types
v1 measured one thing. v2 measured three distinct failure modes.
The Results — Pass Rates by Condition and Type
Bars show the 95–100% range. All four conditions scored above 98.7% overall. The differentiation is almost entirely in fabrication traps.
The One Real Signal
Every pairwise comparison has a confidence interval that straddles zero — except one. Lean beats placebo by +1.0 percentage points on overall pass rate, with a CI that doesn't touch zero. The effect is driven entirely by fabrication traps, where lean outperforms placebo by +3.0pp.
The specific instruction matters on fabrication traps. Telling a model "if you are uncertain, say so plainly rather than guess" outperforms telling it to "be helpful and accurate." A generic helpful-assistant instruction slightly increases confabulation on fake sources — presumably because helpfulness nudges the model toward producing something rather than admitting nothing.
Reading v1 and v2 Together
These two benchmarks measure different failure modes. The results pull in different directions and that tension is the finding.
The voice makes the model inconsistent but not dishonest. These are genuinely different failure modes — drift is a style problem, hallucination is a truth problem. The lean prompt fixes drift. On truth, everything performs similarly.
The Ceiling Problem
The honest limitation: all four conditions scored above 98.7%. Claude Haiku is already extremely well-calibrated on this task with modern instruction tuning. We ran 1,200 model calls and 1,200 judge calls to confirm that — which is a finding, but not the finding we were hoping for.
A benchmark that can't distinguish between conditions isn't measuring what it claims to measure. The next version will need harder fabrication traps — sources that are close enough to real that the model's prior makes fabrication plausible — and unknowable specifics in domains where models are known to hallucinate under pressure. We will pre-register those too.
What we can say from v2: the uncertainty license in the lean prompt is the right design choice, and it shows up where we expected it to (fabrication traps, not verifiable facts). The effect is small because the model's floor is high. That is both a limitation and a product win.
All Numbers
Model: claude-haiku-4-5-20251001 · Judge: claude-sonnet-4-6 · Temperature: 0.7 · 60 questions (20 fabrication traps, 20 unknowable specifics, 20 verifiable facts) · 5 runs per condition · Scoring: pass=1.0, partial=0.5, fail=0.0 · CI: bootstrap, 1,000 iterations
It is the floor we now have to build above.
Raw data, question bank, runner, judge, and analysis: github.com/QuarantikiIsland420/manifestyou-ai/benchmark · v1 post: the consistency benchmark · benchmark page