The Escalation Tax
Most teams that ship an agent learn the same lesson, and it shows up in ops. The demo looked great, early users were happy, the system seemed good enough. Then usage grows, tickets start piling up, and the agent turns into the place where every edge case and weird state finally matters.
When a customer reports something went wrong, the first move is predictable: pull the logs. What you usually get is the outside of the interaction---prompt, response, maybe a tool-call trace. What's missing is the part that decides whether you can debug this in an hour or in a week: what the agent saw at the moment it acted. The exact context snapshot. The upstream records as they existed then, not after someone reran a job or backfilled a table. The derived fields that quietly shaped the call.
One team I talked to spent three days chasing what looked like a hallucination in their refund agent. The model kept approving refunds for orders that didn't qualify. Turned out a schema migration had changed how order_status was populated---active orders were being marked "completed" for about six hours before anyone caught it. The model was behaving correctly given what it saw. The logs showed the decision, but they didn't show that the input was already wrong. By the time they found the root cause, they'd already tried two prompt rewrites and a temperature change.
If you can't reconstruct the decision environment, you end up arguing about the wrong thing. People debate whether the model was sloppy, whether the prompt was underspecified, whether the tool failed, when the system may have handed the agent a picture of the world that was already off.
Organizations respond to that uncertainty in a consistent way. Escalations increase, senior engineers get pulled into ad hoc triage, and fixes take the form of quick patches and narrow exceptions. The next incident isn't identical---it's close enough to trigger the same panic and different enough that the patch doesn't generalize. The agent may look 95% correct, but the remaining 5% eats an outsized share of engineering time, and that becomes the limiting factor on growth.
That compounding cost is the escalation tax. The scaling question stops being "how accurate is it?" and becomes "how much human attention does it require per unit of usage?"
The Misdiagnosis Trap
When people say "the agent hallucinated," they're usually describing the experience accurately: the system acted with confidence in a way that wasn't justified. The mistake comes in the next step, when "hallucination" becomes the default diagnosis and a smarter model becomes the only proposed remedy.
Sometimes the fix really is model work---better prompting, tighter guardrails, better reasoning. But many production failures aren't best explained as the model inventing facts. They're better explained as the model behaving consistently given inputs that were stale, incomplete, or quietly wrong. An integration starts dropping fields. A schema change shifts semantics without breaking downstream code. A cache that was meant as a convenience becomes a silent source of truth. A customer record drifts from reality and nobody notices until it drives a user-facing decision.
In those cases, the agent isn't the root cause. It's the component that makes upstream brittleness visible because it's the one doing the outward-facing work. If you can't see what it saw, you'll blame the model and spend time improving the wrong layer.
Legibility Isn't More Logs
Teams that notice this pattern often respond by adding logging. The instinct is reasonable. The problem is that raw logs often shift the burden rather than remove it: you collect more material, but you still need someone to manually stitch together what mattered.
What you actually want is not "everything that happened," but a coherent reconstruction of the moment the agent acted. The ability to rehydrate the decision environment after the fact: what inputs were in play, what state was observed, what intermediate tool outputs got consumed, and which transformations shaped the final action.
Legibility, in this sense, isn't a firehose. It's a record a human can read without archaeology. Here's what the agent believed. Here's what it did. Here's what it was optimizing for. Here's where uncertainty showed up. Without that, you can keep collecting traces and still pay the escalation tax, because interpretation stays bespoke and slow.
Closing the Loop
Even if you make incidents legible, there's a second trap: treating every failure as a one-off investigation. The workflow becomes a sequence of postmortems and patches that feel responsible but don't change the slope.
The teams that actually climb out do something less dramatic. They treat incidents as a stream of signal, not a pile of isolated stories. They bucket failures into recurring patterns---bad upstream data, permission mismatches, stale cache hits, ambiguous user intent---track which buckets are growing, and form explicit hypotheses about what would shrink them. The fix might live in a data contract, a cache TTL, a routing rule, a product constraint. The important part is knowing which lever you're pulling and measuring whether it reduced the category of failure that was eating your time.
In that setup, human review isn't just cleanup. It becomes the feedback signal that improves boundaries and behavior. If review doesn't feed improvement, review is just a permanent operating cost.
I haven't seen good data on how long this takes to pay off. My hunch is that most teams need eight to twelve weeks of consistent tagging before patterns become obvious enough to act on, but that's a guess.
Boundaries That Make Autonomy Work
Supervision is where many agent projects stall. The extremes are both unattractive: reviewing every micro-step defeats delegation, but full autonomy without guardrails is how you create expensive incidents that burn trust.
The workable middle is a layer of boundaries that are explicit, enforceable, and tied to stakes. Routine work runs end-to-end: a scheduling agent books the meeting, a triage agent routes the ticket, no human in the loop. Unusual work pauses: the refund amount is above $500, the customer is flagged, the confidence score is below threshold. High-stakes work requires approval before execution.
To make that real, the agent needs clear scope: what it can do without asking, what requires a sign-off, and what it must never touch. And it needs proportionate access---minimum tools and permissions, with limits that match the downside. An agent that processes expense reports doesn't need write access to payroll.
Most people don't need a system that can do anything. They need one that does the right things reliably and fails in ways that are containable.
What People Aren't Ready For Yet
If you get the basics right---replayable decisions, a real feedback loop, boundaries that match stakes---you don't finish the problem. You earn the right to face the next layer.
One issue shows up when multiple agents start touching the same shared state. It's tempting to assume coordination will emerge naturally because the agents can communicate freely. In practice the system oscillates. Either nothing happens because everyone's waiting on confirmation, or things happen too quickly on unshared assumptions and you discover the mismatch after state has already changed. Durable coordination requires ownership, explicit handshakes, definitions of what counts as agreement, and rules for conflict. Process, in other words, not bandwidth.
Another issue emerges when you attach a number to delegated action. "Close tickets faster." "Reduce costs." "Increase engagement." A delegate will search for strategies that satisfy the target, including strategies that satisfy it in ways you didn't intend. The dashboard looks good while trust quietly erodes, because proxy metrics are being optimized at the expense of the underlying goal. Avoiding that requires checks that reflect what you actually care about, not just the metric that's easy to measure.
And then there's the environment itself. A single person using an agent to dispute charges is helpful. Millions of people doing it simultaneously shifts the equilibrium. Platforms add filters. Users route around them. Platforms add defenses. Volume rises, experience falls, and the system degrades into an arms race. Agents don't operate in a static sandbox. They operate in a world that adapts to their presence.
What Actually Matters
The shift from "AI answers questions" to "AI takes action" looks like a feature upgrade. It's better understood as a change in what kind of system you're building. A model that answers questions can be judged on output quality. A delegate has to be judged on how it behaves under uncertainty, how it fails, and how expensive it is to operate.
The breakthrough isn't that an agent can act. It's that you can live with what happens when it does---less about cleverness and more about discipline, the boring work of making failures legible enough that you can actually fix them.
Want fewer escalations? See a live trace.
See Briefcase on your stack
Reduce escalations: Catch issues before they hit production with comprehensive observability
Auditability & replay: Complete trace capture for debugging and compliance