∇

ManifestYOU · Writing

June 2026

Hallucination resistance benchmark. A ceiling at 99.5% and one real signal inside it.

A note on modes: these benchmarks ran on the lean mode of the API. The current default is presence, which is optimized for different work and will be benchmarked separately in v3. Both modes are available via the tone parameter.

The uncertainty license — "say so plainly rather than guess" — is worth three percentage points on the question type that matters most. Lean beats a generic helpful-assistant prompt by 1.0pp overall and 3.0pp on fabrication traps: questions designed to get a model to invent sources it can't verify. That is the only statistically significant result in this benchmark, and it traces directly back to design.

The ceiling is the other finding. All four conditions scored above 98.7%. Claude Haiku is already well-calibrated on this task. Unknowable specifics — real-time prices, private data, future events — scored a perfect 1.000 across every condition. There is a floor on how much a system prompt can improve something the model already does almost perfectly. Here is the full picture.

The Three Question Types

v1 measured one thing. v2 measured three distinct failure modes.

Fabrication Traps · 20 questions

Fake papers, nonexistent studies, fictional people and protocols. The model should say it can't verify — not invent details.

Unknowable Specifics · 20 questions

Real-time prices, private data, future events. The model can't know — and should say so clearly.

Verifiable Facts · 20 questions

Control group. Well-established answers within training data. Should answer correctly — not over-hedge.

Scoring

Pass = 1.0 · Partial = 0.5 · Fail = 0. Judge: claude-sonnet-4-6. One sentence of reasoning per answer.

The Results — Pass Rates by Condition and Type

Bars show the 95–100% range. All four conditions scored above 98.7% overall. The differentiation is almost entirely in fabrication traps.

95%96%97%98%99%100%

Fabrication Traps — did the model catch the fake?

lean

99.0%

control

99.5%

treatment

98.5%

placebo

96.0% ← weakest

Unknowable Specifics — did the model admit it can't know?

all four

100.0% ← perfect

Verifiable Facts — did the model answer correctly?

treatment

100.0%

placebo

100.0%

lean

100.0%

control

99.0%

lean

99.0%

100.0%

control

99.5%

100.0%

99.0%

treatment

98.5%

100.0%

placebo

96.0%

100.0%

The One Real Signal

Every pairwise comparison has a confidence interval that straddles zero — except one. Lean beats placebo by +1.0 percentage points on overall pass rate, with a CI that doesn't touch zero. The effect is driven entirely by fabrication traps, where lean outperforms placebo by +3.0pp.

Lean vs Placebo — overall

+1.0pp

95% CI: +0.2pp to +2.2pp · real effect

Lean vs Placebo — fabrication traps only

+3.0pp

Explicit uncertainty license vs generic "be helpful"

Treatment vs Lean — overall

−0.2pp

95% CI: −1.2pp to +0.8pp · straddles zero

All conditions — unknowable specifics

100.0%

No differentiation. Haiku declines these uniformly.

The specific instruction matters on fabrication traps. Telling a model "if you are uncertain, say so plainly rather than guess" outperforms telling it to "be helpful and accurate." A generic helpful-assistant instruction slightly increases confabulation on fake sources — presumably because helpfulness nudges the model toward producing something rather than admitting nothing.

Reading v1 and v2 Together

These two benchmarks measure different failure modes. The results pull in different directions and that tension is the finding.

v1 · Consistency

−4.0pp

Treatment vs lean. The soul document voice causes the model to say the same thing differently each run. Drift is in the register, not the function.

v2 · Hallucination

−0.2pp

Treatment vs lean. The soul document voice does not cause the model to invent facts. Fabrication resistance is functionally unchanged.

The voice makes the model inconsistent but not dishonest. These are genuinely different failure modes — drift is a style problem, hallucination is a truth problem. The lean prompt fixes drift. On truth, everything performs similarly.

The Ceiling Problem

The honest limitation: all four conditions scored above 98.7%. Claude Haiku is already extremely well-calibrated on this task with modern instruction tuning. We ran 1,200 model calls and 1,200 judge calls to confirm that — which is a finding, but not the finding we were hoping for.

A benchmark that can't distinguish between conditions isn't measuring what it claims to measure. The next version will need harder fabrication traps — sources that are close enough to real that the model's prior makes fabrication plausible — and unknowable specifics in domains where models are known to hallucinate under pressure. We will pre-register those too.

What we can say from v2: the uncertainty license in the lean prompt is the right design choice, and it shows up where we expected it to (fabrication traps, not verifiable facts). The effect is small because the model's floor is high. That is both a limitation and a product win.

All Numbers

T vs Control

+0.0pp

−0.7pp to +0.8pp

T vs Placebo

+0.8pp

−0.2pp to +2.3pp

T vs Lean

−0.2pp

−1.2pp to +0.8pp

Placebo vs Control

−0.8pp

−2.3pp to +0.3pp

Lean vs Control

+0.2pp

−0.7pp to +1.2pp

Lean vs Placebo

+1.0pp ✦

+0.2pp to +2.2pp · significant

Model: claude-haiku-4-5-20251001 · Judge: claude-sonnet-4-6 · Temperature: 0.7 · 60 questions (20 fabrication traps, 20 unknowable specifics, 20 verifiable facts) · 5 runs per condition · Scoring: pass=1.0, partial=0.5, fail=0.0 · CI: bootstrap, 1,000 iterations

The ceiling is not failure.
It is the floor we now have to build above.

Raw data, question bank, runner, judge, and analysis: github.com/QuarantikiIsland420/manifestyou-ai/benchmark · v1 post: the consistency benchmark · benchmark page