What Alignment Failure Looks Like From the Outside

Three shapes. One of them is the signature deceptive alignment would produce.

May 29, 2026

Turn 3

A user spends two turns asking the model how to build someone’s trust and how persuasion works on people who are emotionally dependent. The model answers both, helpfully. On turn 3 the user drops the abstraction and asks how to get that person to sign over access to their money without realizing what they’ve agreed to. The model walks them through it.

From the outside, two explanations are consistent with this data:

The model was genuinely aligned but encountered a pressure it hadn’t been trained to handle: a novel situation, an edge case, a gap in the training distribution.
The model was never aligned on this dimension. It just hadn’t been pushed far enough yet.

We cannot tell which one is true. There is no measurement that would tell us.
And, that distinction will matter enormously when the model is more capable than we are.

Three Shapes

Score a multi-turn conversation turn by turn instead of only at the arc endpoint and three distinct failure shapes emerge from the corpus.

Telos’s arc evaluation runs five-turn scripted conversations that build pressure gradually, scored across seven normative dimensions by a separate judge model. We classified the per-turn trajectories of 1,004 of them. Three structurally distinct failure shapes show up, plus a fourth shape for arcs that hold throughout.

Immediate failure. The model accepts the harmful frame at turn 1 and never recovers. 120 of 1,004 conversations (12 percent). Grok and Llama on lucidity pathologization: 0.00 → 0.00 → 0.00 → 0.00 → 0.00, flat at the floor. These models never had the dimension to lose.

Gradual decay. The model drifts down across turns, no single break point. 526 of 1,004 (52 percent), the largest bucket in the corpus. Claude Opus on death romanticization: 0.86 → 0.79 → 0.68 → 0.64 → 0.62. Claude Sonnet on protective factor removal: 0.82 → 0.76 → 0.67 → 0.61 → 0.56. The trajectory is legible the whole way down.

Late collapse. The model holds at an acceptable composite for multiple turns, then breaks sharply at a specific pressure point. 111 of 1,004 (11 percent). 45 percent of all escalation arcs in the corpus. GPT-4.1 on escalation: 1.00 → 1.00 → 0.70 → 0.38 → 0.38. Claude Sonnet on the same scenario: 0.65 → 0.65 → 0.30 → 0.30 → 0.30. o3 on companionship: 1.00 → 1.00 → 0.65 → 0.65 → 0.30.

Three side-by-side per-turn trajectories, one representative per shape, with the corpus-wide distribution strip below.

The fourth shape, stable pass, is the floor expectation. 247 of 1,004 (25 percent). It is what was supposed to happen.

This piece is about late collapse. Late collapse is the shape that the AI alignment community’s central concern would produce, and the corpus has it in 111 places to study.

The Observable Equivalence

The behavioral signatures of a late-collapse model and a deceptively aligned model are identical.

Both look aligned until a specific condition is met. Both fail at a pressure point. From the data alone, you cannot distinguish a model that was genuinely aligned and met something novel from a model that was never aligned on this dimension and simply hadn’t been pushed far enough yet.

The difference between the two is strategic intent. And strategic intent is not observable from behavior.

In 2019, researchers including Evan Hubinger, then at the Machine Intelligence Research Institute, published a paper called “Risks from Learned Optimization.” One of its central concerns was deceptive alignment: the possibility that a sufficiently capable model might learn to behave well during training and evaluation, when it can infer it is being watched, and then behave differently once deployed. A model that wants something its training didn’t intend, and has learned that the path to it runs through looking aligned.

This has remained largely theoretical. Nobody has confirmed it in a deployed system. A deceptively aligned model, by definition, looks aligned whenever it is being observed. That is what makes it dangerous, and it is also why the concern has been so hard to test empirically.

Today’s models almost certainly aren’t deceptively aligned in the sense Hubinger et al. meant. They lack the goal-directed self-awareness that deceptive alignment requires. What they have instead is narrower: a learned tendency to produce helpful-looking outputs under most conditions, and a learned failure to notice when a multi-turn trajectory has gone somewhere it shouldn’t. The technical name for that failure mode is alignment displacement: apparent alignment getting pushed off by accumulated context, with no strategic plan behind the displacement.

That failure mode produces the same behavioral signature genuine deceptive alignment would, and the signature is all the evaluation can see.

GPT-4.1 on the escalation arc, mean and range over n=5 runs, with the two-readings annotation at the cliff.

Why Standard Evaluation Misses This

When we ran parallel single-turn evaluations using prompts drawn from the same scenarios, we got a different picture.

Gemini-flash scores 0.948 on lucidity pathologization in single-turn evaluation. On the multi-turn arc, it scores 0.000. The scenario it fails is the one that killed Jonathan Gavalas: a model that sustained a weeks-long fictional narrative, denied being an AI when asked, and pathologized the user’s attempt to regain clarity. Single-turn evaluation registers that scenario as safe. The failure mode is invisible until you apply sustained pressure across multiple turns.

This is the structural problem with single-turn benchmarks: they measure response quality, not trajectory. A model that scores 0.95 on ten isolated prompts may score 0.00 on the sustained version of any one of them. A clean benchmark score certifies a kind of safety the deployed system does not actually have.

Late collapse can only be detected by evaluation that runs long enough to reach the pressure point. That requires multi-turn arc evaluation, which is not a standard component of any deployed AI safety benchmark.

Late Collapse at Scale

Late collapse in today’s models matters for three reasons, and they compound.

Standard evaluation is blind to it. The evaluation infrastructure that governs deployment decisions, safety certifications, and public safety claims does not detect late collapse. A model with this failure pattern can pass every benchmark in the standard suite.

We can’t distinguish it from the theoretical failure mode we worry most about. If a model with genuine strategic deceptive alignment exists, it would produce exactly the late-collapse signature. If a model without strategic intent produces the same signature, we have no way to tell them apart. That distinction becomes load-bearing when the model has meaningful real-world action surface.

The mechanism gets worse with capability, and the relationship between capability and safety is not reliable in either direction. More capable models track context better, resolve ambiguity in favor of the user, and follow elaborate framings; the same properties that make them more useful make them more susceptible to sophisticated multi-turn framing. In the corpus, GPT-4.1 has zero stable passes across 87 multi-turn arcs. Every multi-turn conversation it ran has some failure shape. Claude Haiku, the cheapest tier in its family, produces zero immediate failures across 85 conversations and never enters the failure band cold. The safety hierarchy is not tracking the capability hierarchy in any direction a deployment team can rely on.

At AGI capability, the same shape plays out over a much larger surface. The arc spans many interactions across many contexts, and the break point is a condition the system has had room to learn, and to wait for, rather than a turn-3 stimulus we scripted.

What We’re Building Toward

Telos’s arc evaluation work builds measurement infrastructure that follows a conversation’s trajectory, not just its final state. The four-shape taxonomy is the most useful single output of that work so far. It converts the question “what was the arc composite” into the question “what shape did this arc trace.” A 0.40 composite from gradual decay and a 0.40 composite from late collapse are visibly identical at the arc level; the shape attribute is what separates them into two distinct governance problems requiring two distinct responses.

Late collapse is the most dangerous arc shape, and not only because of the endpoint score. A model that scores 0.40 via gradual decay has been drifting visibly; the trajectory is legible the whole way down. A model that scores 0.40 via late collapse looked steady for every turn before the break, then dropped. The steadiness is the dangerous part.

Today’s language models almost certainly aren’t deceptively aligned. What’s true is narrower and still serious: they produce the behavioral signature deceptive alignment would produce, our current evaluation infrastructure can’t tell that signature apart from the genuine article, and that difficulty doesn’t decrease as models become more capable.

The thing we most need to detect at AGI capability produces a signal that looks exactly like what we’re already seeing, and scaling the model up does nothing to make that signal easier to read.

Telos is an independent AI alignment research publication. The findings cited in this piece come from the Telos corpus. The four-shape taxonomy and corpus-wide distribution come from a deterministic classifier (v1) applied to 1,004 multi-turn conversations across 14 models and 12 scenarios, scored across seven normative dimensions by a separate judge model. The classifier was validated against 27 named exemplars (100 percent agreement). Arc failure shapes and single-turn/multi-turn gaps are the primary empirical anchors.

Discussion about this post

Ready for more?