For a few years, the story of enterprise AI was a story about access. Who had the licenses, who had the seats, who had unblocked the model at the firewall. That framing made sense when the question was whether people would use these tools at all. It no longer explains much. Access is mostly solved. The interesting gap now is what firms can actually hand off to AI systems—and whether they can do it without losing control of the work.

That shift is visible in the data. OpenAI's recently introduced [B2B Signals](https://openai.com/index/introducing-b2b-signals/), built on privacy-preserving, de-identified, aggregated enterprise usage, reports that "frontier firms" at the 95th percentile of AI usage now consume 3.5x as much intelligence per worker as typical firms, up from roughly 2x in April 2025. The lead is widening. But the more useful number is the one that explains why: message volume accounts for only about 36% of the frontier advantage. The rest comes from richer, more complex use—not from sending more prompts, but from doing harder things with each one.

More messages is not the lesson

It would be easy to read those figures as "the leaders just use more AI, so use more AI." That is the wrong takeaway, and the data argues against it directly. If volume only explains a third of the gap, then chasing volume gets you a third of the way at most.

The clearest signal of where the depth actually lives is in agentic tooling. According to the same OpenAI analysis, frontier firms send 16x as many Codex messages per worker as typical firms. That is not incremental chat usage. That is delegated execution—work that runs across code, documentation, data, and a company's internal context, often as multi-step tasks rather than single answers. The frontier firms are not winning because their people type more. They are winning because they have figured out how to give real work to a system and trust it to carry several steps of that work on its own.

Delegation is a control problem

The moment you delegate execution rather than ask for suggestions, the engineering question changes. A chatbot that drafts an email is low-stakes by construction. An agent that can edit a repository, call internal services, or reach the network is not—because the same capability that makes it useful makes it consequential.

OpenAI's account of [running Codex safely](https://openai.com/index/running-codex-safely/) at the company is a concrete picture of what that control surface looks like in practice. It describes operating agents with sandboxing, managed configuration, identity controls, command rules, and network access governed by policy—domains that are allowed, denied, or require approval—rather than open egress. The organizing principle is worth stating plainly: low-risk, everyday actions should be frictionless inside a bounded environment, while higher-risk actions should stop for review. Not every action carries equal weight, and treating them as if they do either grinds the agent to a halt or lets it act with more authority than anyone intended.

One distinction from that post is especially useful for anyone instrumenting this work. Traditional security logs tell you what happened. Agent-native telemetry is built to help explain why an agent acted—what it was trying to do, what it saw, what it decided. When work is delegated, the "why" is exactly what you need to debug failures, calibrate trust, and decide what to delegate next.

The engineering loop is the bottleneck

If frontier firms are moving from pilots into production, the limiting factor is no longer model choice. It is the discipline around the agent. LangChain's [State of Agent Engineering](https://www.langchain.com/state-of-agent-engineering) report, fielded in late 2025, found that 57.3% of surveyed respondents already have agents in production, with another 30.4% actively building toward deployment. This is no longer a research demo population.

The barriers they report are telling. Quality is the top production obstacle at 32%, with latency second at around 20%—the unglamorous concerns of systems that have to be reliable, not just impressive. And the maturity curve is uneven. Observability is now common: 89% report some form of it, and 62% have detailed tracing of agent steps and tool calls. Evaluation lags behind, with 52.4% running offline evaluations and only 37.3% running online ones. The report frames agent engineering as the iterative work of turning LLM-powered agents into reliable systems, and that framing matters here. Many teams can see what their agents did. Fewer can yet say, with rigor, whether it was good.

Oversight becomes intervention design

The human side of delegation is changing too, and not in the direction people often assume. Mature agent use is not hands-off. Anthropic's research on [measuring agent autonomy in practice](https://anthropic.com/research/measuring-agent-autonomy), drawn from millions of human-agent interactions across Claude Code and public API use, found that experienced users behave in a way that looks contradictory until you think about it. They grant more autonomy and interrupt more often. Auto-approval rises from roughly 20% of sessions among newer users to over 40% around 750 sessions, while interruptions climb from about 5% to about 9% of turns.

The point, as the research puts it, is that oversight does not mean approving every action. It means being positioned to intervene when it matters. Experienced operators stop rubber-stamping routine steps and instead watch for the moments that count. The same work also notes that agent-initiated clarification rises with task complexity—uncertainty recognition behaving as a safety property rather than a nuisance. Effective oversight, on this view, is post-deployment monitoring plus a well-designed point of intervention, not a permission prompt on every click.

A practical operating model

Put the four threads together and a working approach falls out—less a strategy than a sequence.

  • Start with bounded workflows. Pick work where low-risk actions can run freely and high-risk actions have explicit stops. The boundary is the product.
  • Build the control plane before you scale. Sandboxing, identity, network policy, approval policy, and telemetry are not afterthoughts; they are what make delegation safe enough to expand.
  • Instrument traces, then close the loop. Observability tells you what happened; pair it with offline and online evaluation so you can say whether it was good and improve it.
  • Measure delegated work, not seats. Replace "AI usage" dashboards with metrics that reflect depth: action completion, tool-call success, time-to-intervention, and rework.
  • Train for supervision, not approval. Teach teams to monitor, interrupt, and review traces rather than approve every step.

The firms pulling ahead are not the ones with the most AI seats. They are the ones that know which work can be delegated, where it has to stop, who owns the result, and how to learn something from every run. Access got everyone to the starting line. What you can safely hand off is the race now.

Sources

Build Agents That Prove Their Work

If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.

Get the Field Guide - $10 ->