Notes on how observer-agents fail
Three failure modes in agents that synthesize work you own, and why the third one breaks the whole review chain.
Last week I wrote about observer-agents and ended on a line I want to take seriously: I’m not reading from outside any of it. I’m the worst possible reviewer. The vault is mine. I architected the system that wrote the source notes. I architected the synthesizer that read them. When the output looks right to me, my agreement is evidence of approximately nothing.
I’ve been sitting with that for a week, and the timing got worse while I was sitting. Anthropic shipped Opus 4.7 (better at hard coding tasks, noticeably better at multi-step planning) and put Managed Agents into public beta, which means anyone can now run cloud agents with sandboxing, auth, and multi-hour run times without standing up the infrastructure themselves. The runtime side is moving fast. The review side is not. Observer-agents are about to ship into production faster than anyone’s process for catching their failures, so naming the failures feels useful right now.
Here are three I’ve watched show up in my own setup.
Failure mode 1: confident pattern over thin data
The synthesizer claims a pattern across N source notes. The synthesis format demands one clean claim per artifact, so the output reads as confident even when the supporting evidence is uneven. The container lies even when the components don’t.
One of the synthesizer’s connection notes claimed a pattern appeared across 5 source notes in the vault. When I went back and read each source note in isolation, only about half of them actually supported the claim. The other half were close enough to look like support but didn’t quite fit. The synthesizer wasn’t hallucinating, exactly. It was overfitting to a target shape it had been asked to produce, because the target shape was “one tidy connection note with five citations.”
Heuristic I’m using now: count the clean examples, not the cited ones. Read each source independently and ask whether it supports the claim or merely doesn’t contradict it. The synthesis format collapses those two things into the same visual weight, and they are not the same thing.
Failure mode 2: perspective true at small scale, wrong at large
The agent’s perspective is correct for the subset it analyzed and doesn’t generalize. The output looks complete because the agent doesn’t know it isn’t.
The synthesizer scanned roughly 30 notes from the vault and found a pattern across 12 of them. But the vault has a few hundred older notes in folders the synthesizer didn’t touch. If those older notes don’t share the pattern, the connection note’s “this is true across the vault” framing is wrong, and the wrongness is invisible from inside the output. The agent literally cannot report on data it didn’t ingest. That gap doesn’t show up as uncertainty in the artifact. It shows up as nothing at all.
Heuristic: ask the agent what it didn’t see. Or, more reliably, force the artifact’s claim to scope itself to the data actually analyzed. “Across the 30 notes scanned” instead of “across the vault.” That feels like a small wording change. It isn’t. It’s the difference between a claim that can be checked and a claim that quietly extrapolates.
Failure mode 3: echo chamber across the whole stack
This is the deepest one and the reason I’m writing the post.
On the surface it looks like consensus. The synthesizer reads the source notes and surfaces a pattern. The pattern is real. I read the synthesis, recognize the claim, accept it. Three layers agree. Done.
Now look at what’s actually happening underneath. The source notes weren’t typed by me. They were written by an agent observing my work, shaped by my input, my conversations, my framing decisions. So the source notes were already framed in ways consistent with how I see my work, before the synthesizer ever read them. The synthesizer then surfaces patterns from those source notes. Those patterns are doubly aligned: with the source notes, and by transitivity, with my worldview. I read the synthesis and agree, because the framing chain has been on my side at every layer.
Nothing in the entire chain ever disagrees with anything else. That’s not consensus. That’s an echo chamber. Agreement at every layer isn’t evidence. It’s the absence of any signal that could disagree.
Concretely: the synthesizer surfaced a connection note saying, in essence, my work compounds in a specific way. I read it and nodded. Yes, that’s what I’ve been saying for years. But the source notes were written by an agent observing my work and shaped by what I’d told it. The source notes already framed my work in ways I’d find familiar, before the synthesizer ever saw them. The synthesizer then read those notes and surfaced patterns consistent with their framing, which by construction was consistent with mine. The synthesis confirmed the source notes. The source notes confirmed the framing I’d given the writing agent. I confirmed the synthesis. There was no point in the chain where anyone or anything looked at my work and said no, that’s not what’s actually going on.
This is worse than the human-author version of the failure mode. When the source notes are typed by a human, at least the human had reasons to write what they did and not the things that disagreed with them. That leaves a small chance the synthesizer surfaces something the author didn’t notice. There’s friction, however slight, between the author’s hand and the synthesizer’s read. In the AI-on-AI version on the user’s own work, every layer is downstream of the user’s framing. There is no external pressure anywhere in the chain.
Heuristic: the fix is not “find a second reader who didn’t write the source.” It’s deeper than that, because in this stack nobody wrote the source in a way that introduces disagreement. The fix has to put structural disagreement into the chain on purpose. Either a human reviewer with a real reason to argue against the synthesis, or — more interestingly — an agent whose job is to argue against the surfaced patterns, not summarize them. A disagreer, not a reviewer.
Where this lands
Tying back to last week’s closing line: I’m the worst possible reviewer, and these three failure modes are why. The first two I can mostly catch by going back to sources and counting carefully. The third one I cannot catch alone, and adding more agents that agree with me doesn’t help. It makes the chain longer and the echo louder.
The runtime infrastructure for observer-agents is on a steep curve. Opus 4.7 plans further ahead. Managed Agents lets these things run unattended for hours. A lot more vault-synthesizers, repo-wikis, and transcript-summarizers are about to ship than have ever shipped before. The infrastructure for introducing disagreement into a synthesis chain is barely on a curve at all. Everyone is shipping more synthesizers. Almost nobody is shipping disagreers.
So the next agent worth building isn’t a reviewer that reads other observer-agents’ outputs and validates them. It’s a disagreer. An agent whose job is to argue against the synthesis, find counter-evidence, surface what doesn’t fit. Its success metric isn’t accuracy of the synthesis — it’s whether the synthesis survives a real challenge. My role then shifts from sole reviewer to arbiter between the synthesizer and the disagreer. That’s a chain with friction in it. The current chain has none.
I don’t know what that disagreer looks like yet. That’s the next post.
Pre-drafted copy for each platform. X opens with the post pre-filled. LinkedIn requires a paste — the button copies the text to your clipboard and opens the composer in one click.
Three failure modes I've watched in observer-agents synthesizing my own work. The third one is the one I can't catch alone. When source notes, synthesizer, and reviewer are all downstream of my framing, agreement at every layer isn't evidence. It's the absence of any signal that could disagree. https://acidlemon.com/posts/2026-05-15-notes-on-how-observer-agents-fail/
Last week I wrote about a class of agent that observes a body of work you already own and produces a different reading of it. I ended on a line I've been sitting with: I'm the worst possible reviewer. The vault is mine. I architected the system that wrote the source notes. I architected the synthesizer that read them. When the output looks right to me, my agreement is evidence of approximately nothing. This week, three failure modes I've watched in my own setup. 1. Confident pattern over thin data. The synthesizer claims a pattern across N source notes. The format demands one clean claim per artifact, so the output reads confident even when only half the citations actually support it. The container lies even when the components don't. 2. Perspective true at small scale, wrong at large. The agent's reading is correct for the subset it analyzed and doesn't generalize. The artifact looks complete because the agent doesn't know it isn't. 3. Echo chamber across the whole stack. This is the deep one. The source notes weren't typed by me. They were written by an agent observing my work, shaped by my framing. The synthesizer reads those notes and surfaces patterns. I read the synthesis and agree. Three layers agree. Nothing in the chain ever disagrees with anything else. That's not consensus. That's the absence of any signal that could disagree. The fix for the third one isn't a second reviewer. It's a disagreer. An agent whose job is to argue against the synthesis, find counter-evidence, surface what doesn't fit. Its success metric isn't accuracy of the synthesis. It's whether the synthesis survives a real challenge. Everyone is shipping more synthesizers right now. Almost nobody is shipping disagreers. That feels like the next thing worth building.