The Agent Didn't Finish Just Because It Said It Did

A new arXiv paper points at one of the most annoying failure modes in agent systems: the run looks plausible, the agent says it succeeded, and the actual workflow quietly skipped something important. For anyone building browser agents, coding agents, or UI automation, that is not a philosophical problem. It is the exact reason agent QA feels brittle.

“Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents,” by Reshabh K Sharma, Gaurav Mittal, and Yu Hu, proposes a practical way to check agent behavior without hand-writing every assertion. The idea is simple enough to steal: record a few passing runs, learn which states are essential, tolerate harmless variation, then validate future runs against the learned structure instead of trusting the agent’s self-report.

The paper was submitted to arXiv on May 4, 2026, and appeared in the May 7 cs.AI feed. It is not a sweeping benchmark across every agent domain. It is a controlled study around a VS Code extension workflow. That boundary matters. But the architecture is useful because it matches the shape of a real production problem: agents do not always take the same path, and exact replay is too fragile, but some milestones must happen if the task was actually completed.

The Problem With Agent Self-Reports

A lot of agent evaluation quietly depends on the agent telling you whether it succeeded. That works until it doesn’t. A computer-use agent can time out, misread a UI state, click through the wrong branch, or land on a screen that looks close enough to fool its own summary. The final answer may say “done” even when a required step never happened.

Traditional UI testing has its own failure modes. Exact sequence matching breaks when a loading screen appears in one run and not another. Pixel-perfect screenshot comparison breaks on small rendering differences. Manual assertions are expensive because someone has to specify every valid path through a nondeterministic workflow.

The paper’s core move is to stop treating one successful run as a script to replay. Instead, it treats several successful runs as examples of a behavior class. The question becomes: across these valid executions, which states are required, and which states are optional variation?

That framing is the useful part. Production agent testing needs to distinguish “the workflow took a different valid path” from “the workflow skipped the thing that made the task true.”

How The Method Works

The proposed system has three main phases.

First, it captures 2 to 10 known-good execution traces. In the paper’s UI-testing implementation, each trace is a sequence of screenshots plus actions such as clicks and keystrokes. The authors note that the state representation is domain-dependent, so a coding agent might use code snapshots, while another automation system might use API responses or sensor readings.

Second, each trace is converted into a Prefix Tree Acceptor, a graph structure from automata learning where nodes represent observed states and edges represent transitions. Multiple passing traces are then merged into one graph. This is where the system handles nondeterminism: one trace might include a loading screen, another might skip it, and both can still converge on the same meaningful application state.

The merge step uses a tiered equivalence check. Fast visual metrics come first, including perceptual hash similarity, structural similarity, and pixel-change ratios. When the visual signal is ambiguous, the system can ask a multimodal language model whether two screenshots are semantically equivalent. Minor window decoration differences should not split the state graph. A different validation error or missing control should.

Third, the system extracts a dominator tree from the merged graph. In graph terms, a state dominates another state if every path to the second state must pass through the first. In product terms, dominators are the “you cannot honestly be done unless this happened” milestones. Optional loading screens fall away. Required screens, actions, or results stay.

When a new execution trace comes in, the system checks whether its states contain a topological subsequence matching the dominator tree. Extra states are allowed. Missing essential states are not. The output is not just a verdict; the paper describes explainable results including coverage percentage, matched states, missing states, and a validation explanation.

Why This Is Better Than Replay

The practical advantage is that the method does not confuse variation with failure.

Imagine a browser agent opening an app, authenticating, searching for a record, editing a value, and saving. One run may show a spinner. Another may load instantly. One run may trigger a harmless confirmation modal that another run avoids. A replay test can become a mess of special cases, and a screenshot diff can fail for the wrong reasons.

A dominator-style validator asks a cleaner question: did the execution pass through the essential states in a valid order and end in the right kind of terminal state?

That is closer to how humans review an agent trajectory. We do not care whether the spinner appeared. We care whether the agent actually reached the record, made the edit, and saw the save complete state. The paper turns that intuition into an explicit structural check.

What The Case Study Shows

The paper evaluates the approach with a controlled synthetic case study around a custom VS Code extension. The passing setup uses one virtual machine with the extension installed. The failing setup uses another virtual machine without the extension installed, simulating a product bug scenario.

The authors build the dominator tree from three passing traces, then evaluate the remaining 25 traces. Those traces include 14 failing traces and 11 passing traces. The failing group includes 3 agent issues and 11 product bugs. The paper also checks cases where the Computer Use Agent self-assessment is wrong, including a false success and a missed bug.

In that controlled setting, the dominator-tree validator reports 100% detection accuracy across the measured categories: false successes, agent issues, product bugs, and missed bugs. Against the agent self-assessment baseline, the paper reports 82.2% accuracy for the Computer Use Agent versus 100% for the validator, 83.3% precision versus 100%, 60.0% recall versus 100%, and 69.8% F1 versus 100%.

Those numbers are striking, but they should be read with the study boundary attached. This is not a broad proof that the technique will score perfectly in production. The authors are explicit that the benchmark is controlled, synthetic, and small in some categories. One false-success example and one missed-bug example are useful signals, not statistical bedrock.

The safer takeaway is still strong: an independent structural validator caught failures that the agent’s own self-report missed, using only three passing traces as the behavioral reference.

The Implementation Lesson

For builders, the direct lesson is to add an evidence layer between “agent says done” and “system accepts done.”

A lightweight version could look like this. Record a small set of passing traces for a critical workflow. Identify state snapshots or structured milestones that represent genuine progress. Let optional states vary. Validate future runs against the required milestones, and report exactly which milestone was missing when a run fails.

You do not need to copy the whole paper to get value. A production system could start with manually named checkpoints, then evolve toward learned dominators once enough passing traces exist. For UI agents, screenshots and accessibility trees may be enough. For coding agents, checkpoints might be file diffs, test outputs, and command states. For API workflows, checkpoints might be response schemas, database state, or event logs.

The important shift is independent verification. The validator should not be the same agent that performed the task, and it should not accept fluent summaries as evidence.

Where The Paper Is Limited

The limitations are practical and worth respecting.

The current implementation is strongest for visual UI state. Backend services need different representations, such as API responses, database rows, logs, or event streams. Semantic equivalence checking also introduces an LLM dependency and cost, though the paper positions visual metrics as a first pass and fallback. The method requires passing traces, so it cannot learn purely from failures. It also does not model timing constraints, which matters for performance-sensitive flows.

The case study itself is intentionally controlled. A custom VS Code extension is useful for measurement, but real agent environments bring messier failures: flaky services, partial permissions, authentication drift, hidden state, and ambiguous ground truth. The distinction between “agent issue” and “product bug” can also blur in production.

None of that kills the idea. It just keeps the claim honest. This is a pattern to adapt, not a magic QA layer to install blindly.

The Takeaway

The best line to draw from the paper is this: agent validation should be about observed essential behavior, not agent confidence.

As agents move deeper into browser work, software maintenance, and business operations, the failure mode will not always be dramatic. More often, the agent will do most of the right-looking things, skip one essential step, and produce a clean success message. That is exactly where structural validation earns its keep.

For teams building agent workflows now, the playbook is clear. Capture successful executions. Learn or define the milestones that must happen. Tolerate harmless variation. Reject runs that miss essential states. And never let “the agent said it finished” be the final test.

Build Agent QA Around Evidence

If your automations need trace validation, completion checks, or agent workflow hardening, Alchemic Technology builds practical systems that verify the work, not just the final message.

Get the Field Guide - $10 ->

Sources

Sharma, Mittal, and Hu, “Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents,” arXiv:2605.03159v1

arXiv HTML version used for section-level claims