A support agent gets a refund request, apologizes nicely, confirms the amount, names the right policy, and closes with a warm sign-off. The transcript reads beautifully. Your LLM judge gives it a pass. And the refund never actually posted, because the agent called the wrong tool with a malformed argument and then narrated success anyway.

If you have shipped an AI agent into a real workflow, some version of this has already happened to you. The uncomfortable part is that most evaluation setups are structurally blind to it. They score what the agent said and how its tool trace looked, not the state the system would actually be in afterward. For agents that change things—refunds, account locks, address updates, access grants, case escalations, ticket closures—that gap is where production failures live.

Why transcript-first evaluation falls short

The default agent QA stack over-indexes on visible artifacts: the final response, a pass/fail label, the tool-call log, and a judge model's summary. These are all evidence. None of them is the outcome.

A recent survey on evaluating LLM-based agents makes this point at the level of methodology: agent evaluation has to go beyond measuring textual outputs and instead assess sequential decision-making in dynamic environments. Agents plan, call tools, mutate systems, and interact over multiple turns. A metric that only looks at the last message is measuring the smallest, least consequential part of that loop. The survey also flags that current methods have real gaps in cost-efficiency, safety, robustness, fine-grained evaluation, and scalability—and argues that future benchmarks need guardrail metrics that penalize agents reaching apparent success through non-compliant actions.

That phrase—apparent success—is the whole problem. A transcript optimizes for apparent success. A correct end state does not care how confident the prose was.

There is a related trap on the scaling side. Work on General AgentBench, which evaluates agents across a unified pool of search, coding, reasoning, and tool-use tasks, reports a clear performance drop when you move from narrow domain-specific tests to general-agent settings. It also describes two failure modes that should sound familiar to anyone tuning an agent: sequential scaling hits a context ceiling, and parallel scaling creates a verification gap—a correct trajectory may exist among many attempts, but the system can't reliably pick it out. The operational lesson is blunt. Throwing more attempts or longer traces at an agent doesn't buy reliability unless your evaluation can actually verify the right outcome. More tries without better verification just gives you more plausible-looking transcripts to choose wrong from.

Make state the unit of evaluation

So shift the unit of evaluation from the transcript to the state the agent would leave behind. Instead of asking "did this sound right?", ask "what would our system look like if this ran, and is that the state we wanted?"

A useful concrete framing comes from a paper on Proxy State-Based Evaluation for multi-turn, tool-calling agents. The setup acknowledges a real tension up front: deterministic benchmarks wired to realistic backends are genuinely valuable, but they are expensive to build and costly to maintain, especially when workflows change every few weeks. The proposed alternative is to extract an LLM-inferred proxy final state from the complete interaction trace, then judge that state against explicit scenario constraints and an expected final state.

What makes this more than another judge model is the discipline it imposes on the inputs. The framework uses a scenario schema with a defined shape: the user's goal, user facts, system or database facts, the expected final state, and the expected behavior. To write that scenario down, you are forced to answer the question most agent tests quietly skip—"what should be true in the system when this is over?" The framework then wires together a reasoning agent, a user simulator, tool simulators, a proxy state tracker, and judges to compare the reconstructed state against what you declared.

The reported results are encouraging without being magic. The authors measure human–LLM judge agreement above 90% and near-zero hallucination rates from their user and tool simulators in their experiments. They also note the same machinery can generate on-policy and off-policy rollouts for post-training. That last part matters for teams who want their evaluation harness to feed improvement, not just produce dashboards.

Where it fits in a real evaluation stack

Be clear about what proxy state evaluation is and is not. It is not deterministic verification. An LLM reconstructing "the refund was created for $40 and the case was closed" from a trace is making an inference, not reading your production database. In a high-risk domain—payments, healthcare, anything touching access control—an inferred state is not the same thing as ground truth, and you should not pretend otherwise.

What it is: a scalable staging layer for the messy middle. Picture three tiers.

  • Transcript and judge review is cheap, fast, and shallow. Keep it for tone, formatting, and obvious refusals.
  • Proxy state-based evaluation sits in the middle. It costs you the work of writing scenario schemas, but it catches the "looked right, did the wrong thing" failures that transcripts miss, and it scales to workflows that change too fast to justify a full simulator.
  • Deterministic integration tests against real or faithfully simulated backends are the top tier: slow and expensive to build, but the only thing that tells you the truth for your highest-risk paths.

The point is not to pick one. It is to stop using tier-one tooling to make tier-three decisions.

What to implement this week

You don't need the full apparatus to get most of the value. A team can start with a few focused moves:

1. Pick five real scenarios that change state—a refund, an address update, a permission grant, a case escalation, and one where the correct answer is no action because policy forbids it. That last one catches over-eager agents. 2. Write the expected final state for each, in the schema spirit above: user goal, the relevant system facts going in, and exactly what should be true when the interaction ends. 3. Reconstruct the proxy state from each trace and judge it against those declared constraints, rather than scoring the final message. 4. Track unsupported state fields—any claim in the reconstructed state that the trace doesn't actually justify. A rising count of unsupported fields is an early warning that your agent, or your judge, is narrating instead of doing. 5. Promote your stable, high-risk scenarios out of proxy evaluation and into deterministic tests once they stop changing. Proxy state is where workflows live while they're still moving; integration tests are where they retire once they matter enough.

The real frontier

It is tempting to think agent reliability is a model problem, solved by a bigger judge or one more retry. The evidence points the other way. The verification gap doesn't close with more attempts, and apparent success keeps passing review until something downstream breaks.

The more durable lever is unglamorous: better state contracts. Decide, before the agent runs, what should be true after it does—then evaluate against that. A transcript is evidence. The state is the verdict.

Sources

  • [Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents](https://arxiv.org/html/2602.16246v3)
  • [A Survey on Evaluation of LLM-based Agents](https://arxiv.org/html/2503.16416v2)
  • [Benchmark Test-Time Scaling of General LLM Agents (General AgentBench)](https://arxiv.org/html/2602.18998v1)

Sources

Build Agents That Prove Their Work

If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.

Get the Field Guide - $10 ->