There is a wide gap between an assistant that can summarize a prior authorization policy and an agent that is supposed to actually move a prior authorization case through a working stack of clinical applications. The first is a reading-comprehension exercise. The second is a job — one where rules are dense, roles shift, parties multiply, and certain steps cannot be taken back. Most agent demos still live in the first world. The interesting question is what happens when we drop a model into the second.
A new benchmark called [χ-Bench](https://arxiv.org/html/2605.16679v1) attempts exactly that. It evaluates agents on end-to-end healthcare operations across three settings: provider prior authorization, payer utilization management, and registered-nurse care management. The tasks run inside χ-World, a simulator that includes 20 healthcare apps, 151 REST APIs, 87 MCP tools, 3 MCP servers, about 50 simulated patients, around 90 simulated workers, roughly 5,000 chart activities, and approximately 115,000 lines of Python. Each task arrives with a clinical case and a 1,279-document managed-care operations handbook attached as a retrievable skill. No real patient data is involved; the value is that the operational shape of the work — the policies, the role handoffs, the multi-party messaging — is preserved.
What χ-Bench is really stressing is a set of capabilities that most agent benchmarks underweight: policy density (large rule libraries the agent has to actually apply), multi-role composition (the same workflow touches providers, payers, nurses, and patients), and multilateral interaction (the agent talks with several parties whose interests do not align). These are exactly the properties of administrative healthcare work, and they are also the properties that make it economically painful.
The headline numbers are sobering. Across 30 agent harness and model configurations, the best agent resolved only 28.0% of tasks. No configuration cleared 20% on the strict pass^3 metric. When the agent was asked to run every task in a single continuous session, performance collapsed to 3.8%. And critically, χ-Bench reflects a real constraint of this work: many handoffs are terminal. Once a step is submitted or routed, it cannot be edited or re-run. The agent does not get to try again.
It is tempting to read those numbers as a familiar “LLMs aren’t there yet” story. That framing misses the more useful point. The failure mode is not generic reasoning weakness. It is that the agent has to hold a large, versioned rule set in mind, stay inside a particular role’s scope, coordinate with other parties whose state is changing, and refuse to take terminal actions when the evidence is thin. Those are operational skills, not chat skills, and they have to be designed into the system around the model — not assumed of the model itself.
The broader literature is converging on the same instinct. [Claw-Eval-Live](https://arxiv.org/html/2604.28139v2), a live agent benchmark for evolving real-world workflows, argues that final answers are simply not strong evidence that an agent did the right thing. Its grading is action-grounded: execution traces, service audit logs, post-run workspace state, and deterministic checks, with LLM judges reserved only for semantic dimensions deterministic checks cannot cover. The current release includes 105 tasks, 13 public models, 18 controlled services with sandboxed workspaces, 22 task families, 87 service-backed workflow tasks, and 18 workspace repair tasks. The best model passes 66.7% of tasks, and service-backed workflows are notably harder than self-contained workspace repair. The lesson generalizes well beyond that benchmark: if you cannot inspect what the agent did to the system, you do not actually know whether it succeeded.
A complementary lens comes from the interpretability side. [Beyond the Black Box: Interpretability of Agentic AI Tool Use](https://arxiv.org/html/2605.06890v1) proposes pre-action internal monitoring for tool decisions, using sparse autoencoders and linear probes to ask, before a call is executed, whether a tool is needed at all and how risky the next action is likely to be. The authors argue that tool decisions leave readable traces in model state before they ever touch the outside world, and they evaluate the approach on GPT-OSS 20B, Gemma 3 27B instruction-tuned, the NVIDIA Nemotron function-calling dataset, and zero-shot transfer to BFCL. The framework is explicitly positioned as visibility into the tool-decision boundary, not a replacement for external evaluation. For healthcare operations, that boundary is the interesting one: it sits right before the agent writes a chart entry, submits a determination, or routes a case onward.
Stack these three perspectives and a practical design pattern falls out. The first place to use agents in healthcare operations is not the terminal step; it is everything that leads up to it.
- Start with reversible work. Use agents to gather chart context, pull the right policy snippets from the handbook, draft member letters, summarize clinical notes for human reviewers, and prepare evidence bundles. Nothing in this layer commits the organization to a decision.
- Scope agents by role. A utilization management agent should not have the tool affordances of a provider agent or a care management nurse agent. χ-Bench’s multi-role design is a reminder that role boundaries are part of the workflow, not a UI detail. Permissioning should follow.
- Version your policy library. The handbook in χ-Bench is 1,279 documents, and that is a simulator. Real managed-care policy is bigger and constantly changing. Treat the policy corpus as a versioned artifact, log which version the agent retrieved against, and make that version part of the audit record.
- Require evidence bundles for terminal handoffs. Before any submission, routing, or escalation, force the agent to assemble the underlying citations, chart references, and policy clauses. A human reviewer should be able to read the bundle and reproduce the decision without re-doing the work.
- Log traces and state changes, not just answers. Borrow Claw-Eval-Live’s posture: capture execution traces, service-side audit logs, and post-action state. The agent’s narrative summary is not evidence; the trace is.
- Add pre-action review gates for high-risk or low-confidence calls. This is where pre-action monitoring earns its keep. When risk or uncertainty crosses a threshold on a step that is irreversible, the right behavior is to pause, surface the evidence, and require human sign-off — not to ship the action and hope.
None of this is glamorous. It is plumbing. But it is the plumbing that lets a 28%-on-its-best-day system do useful work today without compounding errors into irreversible administrative damage. The χ-Bench result is not a verdict that healthcare agents are unworkable; it is a verdict that ungated autonomy is unworkable. Those are very different claims.
The first durable wins in healthcare agentic AI will probably not look like the demos. They will look like controlled operations infrastructure: scoped agents producing well-cited drafts, traceable actions on reversible steps, versioned policy retrieval, and a tight human review surface at every terminal handoff. That is a less exciting story than “autonomous healthcare administration,” but it is the one that survives contact with the policy maze.
Sources
- https://arxiv.org/html/2605.16679v1
- https://arxiv.org/html/2604.28139v2
- https://arxiv.org/html/2605.06890v1
Build Agents That Prove Their Work
If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.
Get the Field Guide - $10 ->