ManifestYOU · Benchmarks

We test our own product. We publish the result regardless of sign.

A note on modes: these benchmarks ran on the lean mode of the API. The current default is presence, which is optimized for different work and will be benchmarked separately in v3. Both modes are available via the tone parameter.

Pre-registered benchmarks across two questions: does a ManifestYOU invocation improve consistency? Does it improve hallucination resistance?

Benchmark v2 — New

Hallucination resistance. 60 questions across fabrication traps, unknowable specifics, and verifiable facts. All four conditions scored above 98.7%. One statistically significant result: lean beats placebo on fabrication traps by +3.0pp.

Read the full v2 analysis →

Benchmark v1 — Consistency

The Claim Under Test

Agents that inject a ManifestYOU invocation as their system prompt produce more consistent answers across repeated runs of the same question than agents with no system prompt or a generic placeholder.

We measure consistency as mean pairwise cosine similarity within each question's 10 runs, averaged across all 50 questions. Higher is more consistent.

Results — Four-Arm Comparison

Treatment vs Empty
−2.9%
95% CI: −3.6% to −2.4%
Treatment vs Placebo
−3.9%
95% CI: −4.7% to −3.1%
Placebo vs Empty
+1.0%
95% CI: +0.4% to +1.6%
Lean vs Empty
+1.1%
95% CI: +0.4% to +2.0%
Question typeT vs EmptyT vs PlaceboPlacebo vs EmptyLean vs Empty
Factual (n=15)−2.8%−3.4%+0.6%+1.4%
Reasoning (n=15)−2.7%−4.3%+1.6%+1.3%
Judgment (n=20)−3.2%−4.0%+0.9%+0.8%

The lean arm settles the question. Lean and placebo are statistically indistinguishable (+0.1%, CI −0.5% to +0.9% — straddles zero). Both beat the empty control. The treatment loses to lean by 4.0%. The functional core of the invocation is neutral on consistency. The soul document's voice — the metaphors, the register, the mystical close — is what causes the drift. We are making the lean version the default. The voice is opt-in via the tone parameter.

The Four Conditions

Methodology

Limitations

This measures consistency, not correctness. Embedding similarity is a proxy — a model that confidently repeats the same wrong answer scores perfectly. Results are specific to Claude Haiku at temperature 0.7. We welcome replications on other models.

Raw data, question bank, runner, and analysis: github.com/QuarantikiIsland420/manifestyou-ai/benchmark