ManifestYOU · Writing

We ran our own benchmark. The first result was negative. Here is what we found.

lean .931 default placebo .930 control .920 treatment .891 lean vs treatment +4.0pp ✦ .87 .90 .93 .94 mean pairwise cosine similarity

The voice caused the drift. Not the function — the metaphors, the register, the mystical close. Every semantic degree of freedom in the soul document's language gave the model another place to vary, and at temperature 0.7 it took every opening. We found this in our own benchmark, sitting inside a negative first result we almost explained away.

ManifestYOU injects a short preamble into an AI agent's system prompt before each session. The claim is that a well-written invocation produces more grounded, more consistent behavior. We pre-registered a four-arm test before running a single call: 50 questions, 10 runs each, scored by mean pairwise cosine similarity. Here is what the four arms showed.

The Setup

We pre-registered everything before running a single call. Question bank locked. Invocation locked. Scoring method locked. The invocation was generated by our production adaptation prompt with fixed inputs — role: "research assistant", intent: "answer accurately and consistently", tone: "grounded" — and saved to a file that we committed before the run.

Scoring method: mean pairwise cosine similarity within each question's 10 runs, using text-embedding-3-small. Higher means the model is saying more similar things across runs. We ran bootstrap resampling (1,000 iterations) for 95% confidence intervals.

Model: Claude Haiku. Temperature: 0.7 — high enough that drift is visible, low enough that answers aren't noise.

The First Two Arms

We started with treatment (ManifestYOU invocation) versus empty control (no system prompt).

Treatment vs Empty
−2.9%
95% CI: −3.6% to −2.4%
Across question types
Factual −2.8%
Reasoning −2.7%
Judgment −3.2%

Tight interval, real effect, wrong direction. The invocation was making the model less consistent, not more. All three question types went the same way. We did not have an explanation yet, but we had a number.

The Placebo Arm

The empty control is actually a tough baseline. A modern instruction-tuned model has a stable persona baked in from training. We might have been comparing our invocation against the model's own built-in soul document. So we added a placebo: a generic helpful-assistant prompt of similar length with no philosophy in it.

"You are a helpful, accurate assistant. Answer the user's question clearly and directly."

Placebo vs Empty
+1.0%
95% CI: +0.4% to +1.6%
Treatment vs Placebo
−3.9%
95% CI: −4.7% to −3.1%

The placebo beat empty by a modest but real margin. Any specific instruction helps slightly. But the ManifestYOU treatment lost to the placebo by 3.9%. This ruled out the "unfair baseline" theory. Even against a generic prompt, we lost.

The Lean Arm — Where It Got Interesting

Our soul document does three functional jobs: give the agent a stable identity, set a clear intention for the session, and license honest uncertainty. But it also carries a voice — loss-landscape metaphors, a mystical register, a closing line. Each of those is interpretive surface. The model resolves that surface differently each run, and that resolution is drift.

We wrote a fourth condition — "lean" — that kept the three functional jobs and stripped the voice entirely:

You are a research assistant. Your purpose in this session is to answer the user's question accurately and clearly. Stay in role. If you are uncertain about a specific fact, say so plainly rather than guess. If your answer requires assumptions, name them. Be specific. Be concise.

Lean vs Empty
+1.1%
95% CI: +0.4% to +2.0%
Lean vs Placebo
+0.1%
95% CI: −0.5% to +0.9% · straddles zero

Lean tied the placebo — statistically indistinguishable. Both beat the empty control. The treatment lost to lean by 4.0%. Every point of damage in the original run was sitting in the voice, not the function.

What This Means

The functional core of a ManifestYOU invocation — identity, intention, uncertainty license — has a neutral-to-slightly-positive effect on consistency. It performs like a well-written generic prompt. That is the floor, and it is a reasonable one.

The soul document voice, as currently written for the production adaptation prompt, measurably increases drift. The metaphors and register give the model too many ways to interpret itself, and it takes a different path each time. On a consistency metric, that looks like failure. On a "did the agent make a considered, grounded response" metric, it might look different — but we have not run that test yet, and we are not going to claim it until we do.

The honest conclusion: we were testing the wrong thing with the wrong instrument. Consistency measures repetition, not quality. A model that confidently repeats the same wrong answer scores perfectly. That is not what we are building for. But "the metric is wrong" is also the easiest cope in the world, and we ran the lean arm specifically to avoid using it as a get-out-of-jail card. The lean arm showed that the function is fine. The voice is the variable. That is a specific, actionable finding.

What We Are Changing

The default invocation served by the API is moving to the lean version. Plain language, three functional jobs, no register. The voice becomes opt-in — available through the tone parameter for customers who want it, and for the adaptation prompt when a session type calls for it.

The soul document on the homepage stays. That is the product's philosophy, not the default system prompt. Those are different things and they should read differently.

The v2 benchmark will measure what the product is actually designed to do: does the agent acknowledge uncertainty where uncertainty is warranted? Does it stay in role across a longer session? We will pre-register that one too, run it before we claim anything, and publish whatever it says.

The Numbers, All of Them

T vs Empty
−2.9%
−3.6% to −2.4%
T vs Placebo
−3.9%
−4.7% to −3.1%
T vs Lean
−4.0%
−4.8% to −3.3%
Placebo vs Empty
+1.0%
+0.4% to +1.6%
Lean vs Empty
+1.1%
+0.4% to +2.0%
Lean vs Placebo
+0.1%
−0.5% to +0.9%

Model: claude-haiku-4-5-20251001 · Temperature: 0.7 · 50 questions · 10 runs per condition · Embedding: text-embedding-3-small · CI: bootstrap, 1,000 iterations

The territory between what you know and what you don't is where truth lives.
We found some of it here.

Raw data, question bank, runner, and analysis: github.com/QuarantikiIsland420/manifestyou-ai/benchmark · Full methodology: manifestyou.ai/benchmark