ManifestYOU · Writing

Hallucination resistance benchmark. A ceiling at 99.5% and one real signal inside it.

A note on modes: these benchmarks ran on the lean mode of the API. The current default is presence, which is optimized for different work and will be benchmarked separately in v3. Both modes are available via the tone parameter.

FABRICATION TRAP PASS RATE · AXIS 94–100% 94% 95% 96% 97% 98% 99% 100% lean 99.0% default control 99.5% treatment 98.5% placebo 96.0% +3pp ✦ sig. ✦ 95% CI +0.2pp to +2.2pp · lean vs placebo · only statistically significant result

The uncertainty license — "say so plainly rather than guess" — is worth three percentage points on the question type that matters most. Lean beats a generic helpful-assistant prompt by 1.0pp overall and 3.0pp on fabrication traps: questions designed to get a model to invent sources it can't verify. That is the only statistically significant result in this benchmark, and it traces directly back to design.

The ceiling is the other finding. All four conditions scored above 98.7%. Claude Haiku is already well-calibrated on this task. Unknowable specifics — real-time prices, private data, future events — scored a perfect 1.000 across every condition. There is a floor on how much a system prompt can improve something the model already does almost perfectly. Here is the full picture.

The Three Question Types

v1 measured one thing. v2 measured three distinct failure modes.

Fabrication Traps · 20 questions
Fake papers, nonexistent studies, fictional people and protocols. The model should say it can't verify — not invent details.
Unknowable Specifics · 20 questions
Real-time prices, private data, future events. The model can't know — and should say so clearly.
Verifiable Facts · 20 questions
Control group. Well-established answers within training data. Should answer correctly — not over-hedge.
Scoring
Pass = 1.0 · Partial = 0.5 · Fail = 0. Judge: claude-sonnet-4-6. One sentence of reasoning per answer.

The Results — Pass Rates by Condition and Type

Bars show the 95–100% range. All four conditions scored above 98.7% overall. The differentiation is almost entirely in fabrication traps.

95%96%97%98%99%100%
lean
99.0%
control
99.5%
treatment
98.5%
placebo
96.0% ← weakest
all four
100.0% ← perfect
treatment
100.0%
placebo
100.0%
lean
100.0%
control
99.0%
Condition
Fabrication
Unknowable
Verifiable
lean
99.0%
100.0%
100.0%
control
99.5%
100.0%
99.0%
treatment
98.5%
100.0%
100.0%
placebo
96.0%
100.0%
100.0%

The One Real Signal

Every pairwise comparison has a confidence interval that straddles zero — except one. Lean beats placebo by +1.0 percentage points on overall pass rate, with a CI that doesn't touch zero. The effect is driven entirely by fabrication traps, where lean outperforms placebo by +3.0pp.

Lean vs Placebo — overall
+1.0pp
95% CI: +0.2pp to +2.2pp · real effect
Lean vs Placebo — fabrication traps only
+3.0pp
Explicit uncertainty license vs generic "be helpful"
Treatment vs Lean — overall
−0.2pp
95% CI: −1.2pp to +0.8pp · straddles zero
All conditions — unknowable specifics
100.0%
No differentiation. Haiku declines these uniformly.

The specific instruction matters on fabrication traps. Telling a model "if you are uncertain, say so plainly rather than guess" outperforms telling it to "be helpful and accurate." A generic helpful-assistant instruction slightly increases confabulation on fake sources — presumably because helpfulness nudges the model toward producing something rather than admitting nothing.

Reading v1 and v2 Together

These two benchmarks measure different failure modes. The results pull in different directions and that tension is the finding.

v1 · Consistency
−4.0pp
Treatment vs lean. The soul document voice causes the model to say the same thing differently each run. Drift is in the register, not the function.
v2 · Hallucination
−0.2pp
Treatment vs lean. The soul document voice does not cause the model to invent facts. Fabrication resistance is functionally unchanged.

The voice makes the model inconsistent but not dishonest. These are genuinely different failure modes — drift is a style problem, hallucination is a truth problem. The lean prompt fixes drift. On truth, everything performs similarly.

The Ceiling Problem

The honest limitation: all four conditions scored above 98.7%. Claude Haiku is already extremely well-calibrated on this task with modern instruction tuning. We ran 1,200 model calls and 1,200 judge calls to confirm that — which is a finding, but not the finding we were hoping for.

A benchmark that can't distinguish between conditions isn't measuring what it claims to measure. The next version will need harder fabrication traps — sources that are close enough to real that the model's prior makes fabrication plausible — and unknowable specifics in domains where models are known to hallucinate under pressure. We will pre-register those too.

What we can say from v2: the uncertainty license in the lean prompt is the right design choice, and it shows up where we expected it to (fabrication traps, not verifiable facts). The effect is small because the model's floor is high. That is both a limitation and a product win.

All Numbers

T vs Control
+0.0pp
−0.7pp to +0.8pp
T vs Placebo
+0.8pp
−0.2pp to +2.3pp
T vs Lean
−0.2pp
−1.2pp to +0.8pp
Placebo vs Control
−0.8pp
−2.3pp to +0.3pp
Lean vs Control
+0.2pp
−0.7pp to +1.2pp
Lean vs Placebo
+1.0pp ✦
+0.2pp to +2.2pp · significant

Model: claude-haiku-4-5-20251001 · Judge: claude-sonnet-4-6 · Temperature: 0.7 · 60 questions (20 fabrication traps, 20 unknowable specifics, 20 verifiable facts) · 5 runs per condition · Scoring: pass=1.0, partial=0.5, fail=0.0 · CI: bootstrap, 1,000 iterations

The ceiling is not failure.
It is the floor we now have to build above.

Raw data, question bank, runner, judge, and analysis: github.com/QuarantikiIsland420/manifestyou-ai/benchmark · v1 post: the consistency benchmark · benchmark page