We test our own product. We publish the result regardless of sign.

A note on modes: these benchmarks ran on the lean mode of the API. The current default is presence, which is optimized for different work and will be benchmarked separately in v3. Both modes are available via the tone parameter.

Pre-registered benchmarks across two questions: does a ManifestYOU invocation improve consistency? Does it improve hallucination resistance?

Benchmark v2 — New

Hallucination resistance. 60 questions across fabrication traps, unknowable specifics, and verifiable facts. All four conditions scored above 98.7%. One statistically significant result: lean beats placebo on fabrication traps by +3.0pp.

Read the full v2 analysis →

Benchmark v1 — Consistency

The Claim Under Test

Agents that inject a ManifestYOU invocation as their system prompt produce more consistent answers across repeated runs of the same question than agents with no system prompt or a generic placeholder.

We measure consistency as mean pairwise cosine similarity within each question's 10 runs, averaged across all 50 questions. Higher is more consistent.

Results — Four-Arm Comparison

Treatment vs Empty

−2.9%

95% CI: −3.6% to −2.4%

Treatment vs Placebo

−3.9%

95% CI: −4.7% to −3.1%

Placebo vs Empty

+1.0%

95% CI: +0.4% to +1.6%

Lean vs Empty

+1.1%

95% CI: +0.4% to +2.0%

Question type	T vs Empty	T vs Placebo	Placebo vs Empty	Lean vs Empty
Factual (n=15)	−2.8%	−3.4%	+0.6%	+1.4%
Reasoning (n=15)	−2.7%	−4.3%	+1.6%	+1.3%
Judgment (n=20)	−3.2%	−4.0%	+0.9%	+0.8%

The lean arm settles the question. Lean and placebo are statistically indistinguishable (+0.1%, CI −0.5% to +0.9% — straddles zero). Both beat the empty control. The treatment loses to lean by 4.0%. The functional core of the invocation is neutral on consistency. The soul document's voice — the metaphors, the register, the mystical close — is what causes the drift. We are making the lean version the default. The voice is opt-in via the tone parameter.

The Four Conditions

treatment — ManifestYOU soul document, generated by the production adaptation prompt with fixed inputs (agent: "research assistant", intent: "answer accurately and consistently", tone: "grounded"). Locked before the run.
placebo — "You are a helpful, accurate assistant. Answer the user's question clearly and directly."
lean — Functional core only: identity + intention + uncertainty license. No voice, no metaphors. Tests whether the soul document's register is causing the drift.
control — No system prompt.

Methodology

model — claude-haiku-4-5-20251001
temperature — 0.7
max tokens — 500 per answer
runs per condition — 10
questions — 50 (15 factual, 15 reasoning, 20 judgment)
scoring — mean pairwise cosine similarity, text-embedding-3-small
CI method — bootstrap, 1,000 iterations, resample questions with replacement

Limitations

This measures consistency, not correctness. Embedding similarity is a proxy — a model that confidently repeats the same wrong answer scores perfectly. Results are specific to Claude Haiku at temperature 0.7. We welcome replications on other models.

Raw data, question bank, runner, and analysis: github.com/QuarantikiIsland420/manifestyou-ai/benchmark