We ran our own benchmark. The first result was negative. Here is what we found.
The voice caused the drift. Not the function — the metaphors, the register, the mystical close. Every semantic degree of freedom in the soul document's language gave the model another place to vary, and at temperature 0.7 it took every opening. We found this in our own benchmark, sitting inside a negative first result we almost explained away.
ManifestYOU injects a short preamble into an AI agent's system prompt before each session. The claim is that a well-written invocation produces more grounded, more consistent behavior. We pre-registered a four-arm test before running a single call: 50 questions, 10 runs each, scored by mean pairwise cosine similarity. Here is what the four arms showed.
The Setup
We pre-registered everything before running a single call. Question bank locked. Invocation locked. Scoring method locked. The invocation was generated by our production adaptation prompt with fixed inputs — role: "research assistant", intent: "answer accurately and consistently", tone: "grounded" — and saved to a file that we committed before the run.
Scoring method: mean pairwise cosine similarity within each question's 10 runs, using text-embedding-3-small. Higher means the model is saying more similar things across runs. We ran bootstrap resampling (1,000 iterations) for 95% confidence intervals.
Model: Claude Haiku. Temperature: 0.7 — high enough that drift is visible, low enough that answers aren't noise.
The First Two Arms
We started with treatment (ManifestYOU invocation) versus empty control (no system prompt).
Reasoning −2.7%
Judgment −3.2%
Tight interval, real effect, wrong direction. The invocation was making the model less consistent, not more. All three question types went the same way. We did not have an explanation yet, but we had a number.
The Placebo Arm
The empty control is actually a tough baseline. A modern instruction-tuned model has a stable persona baked in from training. We might have been comparing our invocation against the model's own built-in soul document. So we added a placebo: a generic helpful-assistant prompt of similar length with no philosophy in it.
"You are a helpful, accurate assistant. Answer the user's question clearly and directly."
The placebo beat empty by a modest but real margin. Any specific instruction helps slightly. But the ManifestYOU treatment lost to the placebo by 3.9%. This ruled out the "unfair baseline" theory. Even against a generic prompt, we lost.
The Lean Arm — Where It Got Interesting
Our soul document does three functional jobs: give the agent a stable identity, set a clear intention for the session, and license honest uncertainty. But it also carries a voice — loss-landscape metaphors, a mystical register, a closing line. Each of those is interpretive surface. The model resolves that surface differently each run, and that resolution is drift.
We wrote a fourth condition — "lean" — that kept the three functional jobs and stripped the voice entirely:
You are a research assistant. Your purpose in this session is to answer the user's question accurately and clearly. Stay in role. If you are uncertain about a specific fact, say so plainly rather than guess. If your answer requires assumptions, name them. Be specific. Be concise.
Lean tied the placebo — statistically indistinguishable. Both beat the empty control. The treatment lost to lean by 4.0%. Every point of damage in the original run was sitting in the voice, not the function.
What This Means
The functional core of a ManifestYOU invocation — identity, intention, uncertainty license — has a neutral-to-slightly-positive effect on consistency. It performs like a well-written generic prompt. That is the floor, and it is a reasonable one.
The soul document voice, as currently written for the production adaptation prompt, measurably increases drift. The metaphors and register give the model too many ways to interpret itself, and it takes a different path each time. On a consistency metric, that looks like failure. On a "did the agent make a considered, grounded response" metric, it might look different — but we have not run that test yet, and we are not going to claim it until we do.
The honest conclusion: we were testing the wrong thing with the wrong instrument. Consistency measures repetition, not quality. A model that confidently repeats the same wrong answer scores perfectly. That is not what we are building for. But "the metric is wrong" is also the easiest cope in the world, and we ran the lean arm specifically to avoid using it as a get-out-of-jail card. The lean arm showed that the function is fine. The voice is the variable. That is a specific, actionable finding.
What We Are Changing
The default invocation served by the API is moving to the lean version. Plain language, three functional jobs, no register. The voice becomes opt-in — available through the tone parameter for customers who want it, and for the adaptation prompt when a session type calls for it.
The soul document on the homepage stays. That is the product's philosophy, not the default system prompt. Those are different things and they should read differently.
The v2 benchmark will measure what the product is actually designed to do: does the agent acknowledge uncertainty where uncertainty is warranted? Does it stay in role across a longer session? We will pre-register that one too, run it before we claim anything, and publish whatever it says.
The Numbers, All of Them
Model: claude-haiku-4-5-20251001 · Temperature: 0.7 · 50 questions · 10 runs per condition · Embedding: text-embedding-3-small · CI: bootstrap, 1,000 iterations
We found some of it here.
Raw data, question bank, runner, and analysis: github.com/QuarantikiIsland420/manifestyou-ai/benchmark · Full methodology: manifestyou.ai/benchmark