An isometric line-art diagram of the ratchet pattern. A crystalline 'Frozen Eval' sits at the top. Multiple arrows fan down from it labeled 'Parallel Experiments,' feeding a gear with a ratchet pawl in the center. An arrow continues down to a grid platform labeled 'Validated Improvements.' A return loop arrow runs from the platform back up to the experiments fan, illustrating that only validated improvements accumulate.

May 24, 2026

The agent control plane is the actual fight

Karpathy joined Anthropic to run autoresearch at frontier scale. The same closed loop is the operator-tier curriculum now, one tier down.

Anthropic hired Andrej Karpathy this week, into the pre-training team under Nick Joseph, and the framing of the hire is unusually honest. TechCrunch wrote it up plainly: the mandate is to use Claude to speed up the research that produces the next Claude. Not “AI safety,” not “alignment research,” not a softer mission statement. The work is recursive. The model trains the next model, by way of agents that run experiments faster than humans can.

The structural principle is already public. Karpathy shipped autoresearch in March: about 630 lines where a network of agents runs parallel experiments against a frozen evaluation harness on a fixed five-minute compute budget. Variants that beat the eval are merged in. Variants that don’t are thrown out. Improvements accumulate. The loop can only move forward, because the eval is frozen and the budget is fixed and nothing gets to declare itself good without earning it.

VentureBeat wrote the enterprise version of the same observation a couple of days later: Claude’s next fight is the agent control plane, not the model. The piece reads like trade-press positioning, but the underlying claim is right. Frontier model quality is converging fast enough that nobody’s enterprise budget is going to be decided by which model is half a point better on a benchmark. The decision is going to be made on the loop around the model. The harness. The eval. The trace store. The thing that turns “we deployed an agent” into “we know our agent is getting better, and we can show you the curve.”

So that’s the news beat. What made it land hard for me is that the same week, the operator-tier version of the same idea stopped being implicit and started being curriculum.

What I mean by operator tier

There’s a layer below frontier labs where the loop has to exist at a different speed and against different evidence. Frontier labs run autoresearch against eval harnesses on five-minute compute budgets. Operators run agents in production against real users and real consequences over days and weeks. Same loop. Different clock. Different evidence.

I’d been treating that operator-tier loop as something I was building because it was obviously the right thing to build, not because there was a public conversation about it. That changed this week. I saved something like a dozen posts on the same theme inside seven days, and the themes lined up in a way that wasn’t coincidence.

The first theme was harness engineering as a named discipline, separate from prompt engineering. KV cache management. Semantic caching. Eval design as a first-class artifact. Model routing. Agent guardrails. Observability over LLM traces. Cost attribution per agent run. None of these are prompts. They’re the system around the prompt. People building seriously have started writing about them as a stack, the way infra people wrote about Kubernetes components a decade ago, and there’s now a GitHub Certified Agentic AI Developer track that codifies a lot of it. When a discipline turns into a cert, the discipline is real.

The second theme was cost of context as a budget you measure. Karpathy’s line that’s been circulating, that 90% of an AI coding bill is paying for context you didn’t need to send, lands because it’s structurally true. The agent’s input is the line item, not the agent’s reasoning. If you don’t have a trace store that lets you ask “what did this run actually receive,” you don’t have a cost discipline. You have a vibe.

The third theme was the control-room pattern. A main agent orchestrating specialized sub-agents, plus a coordination layer, plus an automated tier on top driven by cron or events. The Hermes Agent write-up is the cleanest version I saw this week, but it’s the same shape several teams are converging on independently. One supervisor. Specialists underneath. An orchestrator that knows which specialist to wake up. A cron loop on top that wakes the whole thing up on its own. None of this is novel as an architecture. What’s new is that people are treating it as the obvious starting point, not the ambitious one.

The fourth theme was tooling that makes the harness cheaper to run. MCP servers that index a codebase into a knowledge graph and cut tool calls by 94% on retrieval tasks. Direnv plus a real secrets manager as a base layer, not as an afterthought. Tailscale plus termius plus tmux as a default install before anyone touches the agent code. Install-order foundation posts.

Pull all four themes together and what they describe is the same closed loop as autoresearch, one tier down. The frontier-lab version is a research ratchet. The operator-tier version is a production ratchet. Same shape. Different evidence.

What I’ve been building

I’ll abstract this, because the specifics are work-context I shouldn’t put on a personal blog, but the architecture is worth naming because it matches the curriculum.

I set up a system where an internal agent runs in production. Every interaction with it produces a structured trace: the inputs it received, every tool call it made, the intermediate outputs, the final output. The trace warehouse stores all of them. A lightweight feedback UI inside the product captures user reactions on individual outputs, attached back to the trace. A meta-agent runs over the trace warehouse on a cadence, looks for failure patterns, and surfaces them. Validated corrections feed back into the agent’s harness, either as new guardrails, new examples in the few-shot frame, or in some cases as updated tool definitions.

That’s the loop. Trace, evaluate, surface, correct, redeploy. It’s slower than autoresearch’s five-minute cycle because the evidence is real users instead of a held-out eval set, but the structure is the same. The frozen artifact is the user signal. The variants are corrections. Only validated corrections accumulate.

A couple of weeks ago I was watching the trace warehouse fill up and noticed that the failures clustered in ways the agent itself couldn’t see from inside any single run. One whole class of failure was that the agent was burning tokens re-asking for context it had been given earlier in the same session, because the harness wasn’t surfacing that context cleanly. That’s not a model problem. The model was doing the right thing given what it received. That’s a harness problem, and it was invisible until the trace store let me look at the distribution instead of the individual.

The model swap is the easy part. The loop that lets me see “you’re paying for context you didn’t need to send” across thousands of runs is the part that compounds.

Why the wrapper-platform pitch makes me twitch

There’s a category of platform right now that sells you the agent orchestration as a service. You wire your APIs in, the platform handles routing across multiple model providers, the platform handles the orchestration layer, the platform handles observability, and you pay a markup on the underlying model spend in exchange for not having to build any of it.

The pitch is real. The platforms do work. For a team that just needs an agent in production by Friday, paying the convenience tax is the right call.

But notice what you’re trading. The trace store is theirs. The eval harness, if there even is one, is theirs. The patterns across thousands of runs are theirs to see, not yours. When the model layer changes underneath, their loop gets to learn from your traffic. Your loop doesn’t exist, because you never built one. You’re renting the position of “team running an agent” without owning the position of “team whose agent is measurably better than it was last month.”

In year one that trade is fine. The convenience is real and the compounding hasn’t started yet. In year three the compounding is the whole game, and the team that owns its own trace warehouse and its own evaluator is on a curve the wrapper customers structurally can’t be on. They’re paying a margin tax to stay flat while someone else’s ratchet moves.

There’s a version of this where I’m wrong. The version where the wrapper platforms get good enough at the harness layer that the gap between “use the wrapper” and “build your own” closes, and the operator-tier teams that built their own ratchets are stuck maintaining infrastructure that’s now table stakes. That’s a real scenario. It’s the same scenario as “should I have built my own Kubernetes” in 2018, and the honest answer for most teams in 2026 is no. The harness layer might commoditize the same way.

What I don’t think commoditizes is the eval. The eval is the thing that says what “good” means for your specific agent in your specific context. The harness can be a vendor’s. The trace warehouse can be a vendor’s. The eval has to be yours, because the eval is the product strategy expressed as a measurement. If you don’t own that, you don’t own the direction the ratchet moves in, which means the ratchet isn’t really yours even when the infrastructure underneath it is.

The architect framing matters here

I’ve written before that I architected the system that produces my source notes. I didn’t type them. I built the loop that writes them. That distinction matters in this post too, because the operator-tier agent I’m describing isn’t something I’m sitting and prompting. I designed the harness, the trace schema, the feedback capture, the meta-agent’s evaluation prompts, the cadence the corrections feed back on. The outputs the agent produces are products of that system. The trace store is a product of that system. The improvements over time are products of that system.

The thing that’s mine, the thing that compounds, is the loop. Not the prompts. Not the outputs. Not even the current generation of the agent. The loop is what survives the next model swap, the next vendor change, the next time the underlying tooling reshuffles.

That’s the same claim Anthropic is making by hiring Karpathy. The model is downstream. The loop that produces the model is the thing they’re investing in. The operator-tier version is the same claim, run on a slower clock, against messier evidence, with smaller compute. Same shape.

The week that frontier labs hired the person who shipped the cleanest public version of the ratchet is the same week the operator-tier discipline became curriculum. That’s not coincidence. That’s two tiers of the same idea showing up in the same news cycle, because the idea is load-bearing at both tiers and the people running each tier figured it out at roughly the same time.

The loop is the part that compounds. The rest is implementation detail.

#agents#ai#patterns#building

Pre-drafted copy for each platform. X opens with the post pre-filled. LinkedIn requires a paste — the button copies the text to your clipboard and opens the composer in one click.

// X / Twitter

Karpathy joined Anthropic this week to apply autoresearch at frontier scale. Same week the operator-tier discipline of running agents in production stopped being implicit and started being curriculum.

The closed loop is the actual product. The model swap is the easy part.

https://acidlemon.com/posts/2026-05-24-the-agent-control-plane/

Post to X ↗339 chars

// LinkedIn

Andrej Karpathy joined Anthropic's pre-training team this week. The explicit mandate is to use Claude to speed up the research that produces the next Claude. The structural principle is his autoresearch project from March: parallel agent experiments run against a frozen eval on a fixed compute budget, and only validated improvements accumulate. A ratchet that can only move forward.

The same week, the operator-tier version of that loop stopped being implicit and started being curriculum. Harness engineering as a discipline distinct from prompt engineering. Cost of context as a measured budget, not an afterthought. Observability and trace warehouses as first-class infrastructure. The closed-loop eval as the actual moat.

I've been building the operator-tier version of this for a while. An in-product feedback UI captures reactions. A trace warehouse stores every input, tool call, and output. A meta-agent reviews the traces and surfaces failure patterns. Validated corrections feed back into the agent. Same shape as autoresearch. Different tier. Slower loop. Same direction of travel.

The thing I keep coming back to is that the wrapper platforms selling multi-API orchestration are selling convenience for a margin tax, and what they're really selling is the right to not own your own ratchet. That's a fine trade in year one. In year three it's the whole game, and you don't have it.

The model swap is the easy part. The loop that produces the model swap is the part that compounds.

1499 chars