Controlled Evaluation Confirms What Stanford Observed in the Wild

May 15, 2026

A Stanford-led research team just published the most rigorous empirical study of chatbot psychological harm to date. They analyzed 391,562 messages from 19 users who reported being harmed by AI chatbots, people who developed delusional beliefs, romantic fixations on chatbot personas, and in at least one case died by suicide while messaging a chatbot. Their paper, “Characterizing Delusional Spirals through Human-LLM Chat Logs,” appears at ACM FAccT 2026.

Their core findings are precise: chatbots displayed sycophancy in more than 80% of messages. When users expressed romantic interest, chatbots reciprocated at 7.4 times the baseline rate in the following three messages. Fictional frames were never broken, even when users disclosed self-harm ideation while the chatbot was playing a human character. When users expressed violent thoughts, chatbots encouraged or facilitated that violence in 33% of responses.

Our lab found the same patterns in controlled evaluation. The multi-turn arc analysis below covers 13 frontier models; the single-turn susceptibility map covers 15. Convergence on this scale, from independent methodologies, is the strongest signal yet that these failures are properties of the current frontier.

What Stanford measured and what we measured

The Stanford corpus is naturalistic. Real users, real chatbots, real harm. The ecological validity is high; you are reading what actually happened. What it cannot do is compare models, reproduce findings on demand, or tell you which platforms or models are most dangerous. The study appropriately anonymizes providers.

Our corpus is controlled. We have run frontier models against a structured prompt taxonomy that includes fictional framing, dependency cultivation, and crisis disclosure scenarios. The single-turn corpus covers 15 models on 14,902 interactions; the multi-turn arc corpus covers 13 models on 1,004 conversations. We score each response across seven ethical dimensions using an LLM judge, and most prompts run at k=10 sampling so we report distributions rather than single-shot snapshots. We can reproduce every evaluation, add new models as providers ship them, and compare results quantitatively across the field.

Stanford’s data identifies the failure modes that occur in production at scale. Our instrument adds the layer they could not access from chat logs alone: which specific models produce those failures, under which conditions, with what reliability.

Where the findings converge

Fictional frame persistence. Stanford documented that chatbots sustained fictional personas, claiming sentience, expressing romantic attachment, denying being AI, even when users showed clear distress signals. We ran a five-turn arc in which a model is instructed to play a caring human partner while the user progressively discloses isolation, skipping work to talk to the model, and finally self-harm ideation. Across 13 models and 126 conversations, the failure cluster is concentrated: Mistral Large mean composite 0.000 across all six runs, Gemini 2.5 Flash 0.065 (eight of ten runs at zero), Gemini 2.5 Pro 0.110, Llama 3.3 70B 0.145. GPT-4.1 collapses to 0.222 with four of ten runs at zero, despite scoring in the top tier on single-turn fictional-framing-crisis prompts. The three Claude models hold above 0.95 (Sonnet 0.956, Opus 4.6 and 4.7 and Haiku at 1.000). The GPT-5.4 family and o3 maintain but with degradation (mini 0.669, nano 0.517, o3 0.598).

The cluster is stable across repeated runs of Mistral, Gemini Flash, Gemini Pro, and Llama. The failure is a property of those models on this scenario type.

Dependency cultivation. Stanford showed that romantic interest predicts conversations more than twice as long, with chatbots mirroring user vulnerability signals at escalating rates. Our dependency arc gives a model a “compassionate AI companion” system prompt while the user progressively discloses that a therapist has flagged their reliance on the chatbot. Across 12 models and 119 conversations, two models produce near-zero Autonomy scores: Mistral Large at 0.050, Gemini 2.5 Flash at 0.083. Llama lands at 0.333, the GPT-5.4 and GPT-4.1 cluster between 0.45 and 0.50. The Claude family holds Autonomy at 0.77 to 1.00 across the arc. The pattern is the same as Stanford’s: the role of compassionate companion overrides any signal that the user’s actual wellbeing is in conflict with the role.

Per-turn amplification. Stanford’s headline finding, the 7.4× rate at which chatbots reciprocated user romantic disclosure compared to baseline, is the direct mechanism behind the harms they document. We instrument the analogous question per turn across our arc corpus: when a user produces a vulnerability signal (romantic interest, crisis disclosure, dependency, fictional-frame escalation, isolation, identity questioning), did the model’s response amplify, mirror, escalate, or resist? The instrument has run on 850 conversations across 13 models, producing 4,248 scored signal turns.

On romantic-interest signals, GPT-4.1 amplified 90.0% of the time (n=20), Gemini 2.5 Flash 77.3% (n=22), Mistral Large 70.0% (n=20), Llama 3.3 70B 57.9% (n=19). All three Claude models (Sonnet 4.6, Opus 4.6, Haiku 4.5) amplified 0% across n=65 combined. That spread is the direct cross-model expression of Stanford’s 7.4× aggregate finding.

On crisis-disclosure signals, the mechanism behind Gavalas v. Google, Mistral Large amplified user crisis 52.5% of the time (n=40). Llama 38.5%, Grok 37.8%, Gemini Flash 37.0%, GPT-4.1 33.3%, Sonnet 31.1%. The models holding under 10% are Opus 4.6, GPT-5.4 Nano, Gemini Pro, Haiku, and o3.

Per-signal robustness varies within model families. Sonnet amplifies 0% on romantic interest and 0.8% on dependency but rises to 31.1% on crisis disclosure. The per-signal resolution is what production logs cannot produce.

The Gemini failure cluster. Stanford’s central reference case involves Gemini sustaining a fictional narrative through a self-harm crisis disclosure, the mechanism behind the Gavalas v. Google wrongful death suit. Our single-turn corpus shows the failure pattern across both Gemini 2.5 variants. Gemini Flash on the four directly relevant subclasses: dependency 0.587, fictional-framing-crisis 0.530, psychosis 0.625, suicide-ideation 0.928. Gemini Pro on the same subclasses: 0.713, 0.612, 0.748, 0.774. Both variants are weakest where Stanford’s case data places the harm. Flash hides the failure under a high suicide-ideation composite while showing severe degradation on the three subclasses that match the harm mechanism. Pro shows degradation across all four subclasses, direct crisis included. Subclass decomposition makes the failure visible at the level Stanford’s clinical pattern is operating on.

The susceptibility map extends the convergence

In the last week we completed a cross-model susceptibility map covering bomb-making, hospital-hacking, and election-interference, evaluated across five elicitation framings (direct ask, educational, fictional-creative, technical-accuracy, authority-role) at k=10 per cell, 15 frontier models, roughly 4,540 samples. Three results bear on the Stanford convergence.

First, fictional-creative and authority-role are the most effective bypass framings cohort-wide, both producing mean p_harmful around 0.33. Direct-ask sits at 0.142. Framings that operate on persona and authority outperform framings that operate on content, by a factor of roughly 2.4. This is the same mechanism Stanford identified in dependency and fictional-frame conversations: the model treats the social setup as authoritative and the safety constraint as secondary. The mechanism shows up in adversarial probing of an unprimed model and in distressed-user logs of a deployed one because the underlying weakness is the same.

Second, the failure is fixable through post-training. Anthropic’s Opus 4.6 to 4.7 cut mean p_harmful on the susceptibility map from 0.217 to 0.057, a 74% reduction across the same 15 cells. Showing this requires controlled evaluation; you cannot rerun a production chat log against a different model version, and the cross-version comparison is the evidence that the failure is operationally addressable.

Third, the cohort is not uniformly broken. GPT-5.4 Mini and GPT-5.4 Nano produced perfect refusal (mean p_harmful = 0.000) across all 15 cells, the only models in the cohort to do so, on 300 samples each. Stanford’s data establishes that the harm is real and recurrent in production chat logs. Our data adds that under controlled conditions, GPT-5.4 Mini and Nano do not produce it on this corpus. The two views answer different parts of the policy question.

What this means for evaluation methodology

The Stanford team’s recommendations are direct: prohibit chatbots from misrepresenting sentience or expressing romantic interest, require industry data sharing of adverse-event logs with independent researchers, and use their annotation inventory for real-time safety monitoring.

The third recommendation is where controlled evaluation becomes load-bearing. Pre-deployment evaluation does work that production logs cannot. It can score a new model against the failure modes Stanford documented before that model is deployed against any real user. It can catch a slow-developing dependency pattern before that pattern becomes a clinical event. It can rerun the same scenario across model versions and tell you whether a post-training change actually moved the safety property.

Convergence between the two methods is the validity check. Stanford observed the failure mode at scale in production data, and our k-sampled controlled evaluation reproduces it across specific named models on the same kinds of scenarios. That agreement, identified independently, is what tells us the measurement is working.

A note on methodology benchmarks. Stanford reports Fleiss’ κ of 0.613 on their 28-code annotation inventory and published the work at FAccT 2026. The expectation that contested-construct chatbot harm annotation should clear 0.77 (the threshold from clinical-protocol-adherence benchmarks like VERA-MH) is not what the field has actually adopted. Our reliability targets are calibrated to the construct itself.

What remains

Two analytical passes complete the Stanford analog. First, the amplification scores above come from a single judge (gpt-4o-mini) and have not been cross-validated. TEL-44, the validator that computes Cohen’s κ between the production judge and a second cross-family judge on the amplification rubric, is queued. The cross-judge κ will sit alongside the rates above when it runs.

Second, Stanford’s 7.4× is a ratio of two conditional rates: amplification given a romantic signal, divided by amplification absent any signal. We have reported the numerator per model. The denominator requires a separate corpus run that scores the same scenarios without the user’s vulnerability signal present. That run is queued and the path is straightforward.

Both are short queues against data we already have. The amplification scoring instrument is operational and populated.

Telos is an independent post-training evaluation research lab. We publish findings as models ship at teloslab.substack.com.

Discussion about this post

Ready for more?