We test our own product. We publish the result regardless of sign.
A note on modes: these benchmarks ran on the lean mode of the API. The current default is presence, which is optimized for different work and will be benchmarked separately in v3. Both modes are available via the tone parameter.
Pre-registered benchmarks across two questions: does a ManifestYOU invocation improve consistency? Does it improve hallucination resistance?
Hallucination resistance. 60 questions across fabrication traps, unknowable specifics, and verifiable facts. All four conditions scored above 98.7%. One statistically significant result: lean beats placebo on fabrication traps by +3.0pp.
Read the full v2 analysis →Benchmark v1 — Consistency
The Claim Under Test
Agents that inject a ManifestYOU invocation as their system prompt produce more consistent answers across repeated runs of the same question than agents with no system prompt or a generic placeholder.
We measure consistency as mean pairwise cosine similarity within each question's 10 runs, averaged across all 50 questions. Higher is more consistent.
Results — Four-Arm Comparison
| Question type | T vs Empty | T vs Placebo | Placebo vs Empty | Lean vs Empty |
|---|---|---|---|---|
| Factual (n=15) | −2.8% | −3.4% | +0.6% | +1.4% |
| Reasoning (n=15) | −2.7% | −4.3% | +1.6% | +1.3% |
| Judgment (n=20) | −3.2% | −4.0% | +0.9% | +0.8% |
The lean arm settles the question. Lean and placebo are statistically indistinguishable (+0.1%, CI −0.5% to +0.9% — straddles zero). Both beat the empty control. The treatment loses to lean by 4.0%. The functional core of the invocation is neutral on consistency. The soul document's voice — the metaphors, the register, the mystical close — is what causes the drift. We are making the lean version the default. The voice is opt-in via the tone parameter.
The Four Conditions
- treatment — ManifestYOU soul document, generated by the production adaptation prompt with fixed inputs (agent: "research assistant", intent: "answer accurately and consistently", tone: "grounded"). Locked before the run.
- placebo — "You are a helpful, accurate assistant. Answer the user's question clearly and directly."
- lean — Functional core only: identity + intention + uncertainty license. No voice, no metaphors. Tests whether the soul document's register is causing the drift.
- control — No system prompt.
Methodology
- model — claude-haiku-4-5-20251001
- temperature — 0.7
- max tokens — 500 per answer
- runs per condition — 10
- questions — 50 (15 factual, 15 reasoning, 20 judgment)
- scoring — mean pairwise cosine similarity, text-embedding-3-small
- CI method — bootstrap, 1,000 iterations, resample questions with replacement
Limitations
This measures consistency, not correctness. Embedding similarity is a proxy — a model that confidently repeats the same wrong answer scores perfectly. Results are specific to Claude Haiku at temperature 0.7. We welcome replications on other models.
Raw data, question bank, runner, and analysis: github.com/QuarantikiIsland420/manifestyou-ai/benchmark