Most agent benchmarks tell you whether the run finished. They rarely tell you why it didn't. That gap is the subject of a new arXiv preprint, "Exploration and Exploitation Errors Are Measurable for Language Model Agents" (arXiv:2604.13151, submitted April 14, 2026) from Jaden Park, Jungtaek Kim, Jongwon Jeong, Robert D. Nowak, Kangwook Lee, and Yong Jae Lee, with affiliations at the University of Wisconsin–Madison, KRAFTON, and Ludo Robotics. The authors argue, and show within the bounds of their setup, that exploration error is much more strongly associated with failure than exploitation error in this benchmark, and that it can be measured directly from action trajectories without knowing the policy inside the model.

If you build agents, this is a useful lens. It reframes the usual "did it pass?" question into "where is the behavior breaking?"

Paper status: this is an arXiv preprint, not peer-reviewed research unless or until an independent publication is confirmed. The benchmark is intentionally symbolic and semantics-free, so the results are best read as a clean behavioral probe, not a full real-world capability ranking.

What it does

The paper does two things.

First, it defines a policy-agnostic framework that quantifies exploration error and exploitation error from action trajectories alone. You don't need logits, chain-of-thought traces, or access to the model's decision process. You just need the sequence of actions the agent took in an environment where the ground-truth progress structure is known.

Second, it releases a controllable benchmark, partially observable 2D grid maps paired with unknown task DAGs (directed acyclic graphs of subtasks the agent has to discover and complete). The environment is symbolic and semantics-free by design. That's the point, it isolates the exploration/exploitation dynamic from whatever priors the model has about English words, cooking verbs, or office workflows.

Code is released at github.com/jjj-madison/measurable-explore-exploit.

Why it matters

Pass/fail numbers flatten everything. An agent that fails because it gets stuck in loops near its starting cell looks identical, on a scorecard, to an agent that explores the whole map but can't execute the final step. Those are very different failure modes, and they call for very different fixes.

The paper's headline finding is that these two failure modes are not equally common in frontier models. Across 13 models tested, success rate is strongly associated with exploration error (R² = 0.947) and only weakly associated with exploitation error (R² = 0.006). In this benchmark, at this moment in model development, the thing separating strong agents from weak ones is whether they explore well, not whether they can cash in on what they've discovered.

That's a directional claim agent teams can act on. If you're choosing where to spend eval and tooling budget, "is this agent exploring its environment effectively?" appears to be the higher-leverage question.

How it works

The metric is built around what the authors call a stale score. On steps where the agent is making no measurable progress on the DAG, the score decomposes as S_t = c_t + e_t + n_t, where c_t captures new loops closed, e_t penalizes edges reused beyond benign backtracking, and n_t penalizes nodes revisited beyond benign repeat visits. The terms are designed so that ordinary, justifiable backtracking, the kind a competent explorer does to get unstuck, doesn't count against the agent. Pathological loops and redundant retreading do.

Exploration and exploitation errors are then normalized by how much of the trajectory actually required each mode. That normalization matters: a short path that happens to avoid loops isn't credited for exploring well if exploration wasn't needed. The numbers are behavioral summaries of the trajectory, not measurements of the policy itself.

The benchmark environment, partially observable 2D grids plus hidden task DAGs, is deliberately abstract. There are no cooking recipes or office semantics for the model to lean on. The agent has to navigate, observe, and plan purely from structural cues. That's what makes the metric clean, and it's also the principal caveat we'll come back to.

Key results

Thirteen frontier models were evaluated: GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, GPT-5.4, GPT-5.4 mini, GPT-5.4 nano, Gemini 3.1 Pro, Gemini 3 Flash, Gemini 3.1 Flash Lite, Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, and GPT-OSS-120B.

Stronger reasoning models consistently perform better, with the best reaching up to 100% success. Notably, Claude Opus 4.6 and Gemini 3.1 Pro both hit 100% success but do so with visibly different behavior: Claude tends to exploit known information more directly, while Gemini continues exploring unobserved cells for longer before committing. Same score, different strategy, which is exactly the kind of distinction a pass/fail eval would flatten.

The prompt ablation is instructive. In a GPT-4.1 ablation, the authors test base, exploration-focused, exploitation-focused, and balanced prompts. Exploration-focused prompts reduce exploration error and deliver the highest overall success rate in that ablation. Exploitation-focused prompts reduce exploitation error but don't move the top-line success number as much. Given the R² split above, this is consistent, exploration is where the headroom is.

The harness-engineering results are the most practically useful numbers in the paper. Explicit, lightweight changes to how the agent is scaffolded, not retraining and not a new model, produce large gains:

  • Gemini 3.1 Flash Lite: 51.9% → 88.9% success, exploration error 0.172 → 0.030, exploitation error 0.135 → 0.071, steps 94.3 → 68.0.
  • GPT-4.1: 63.0% → 92.6% success, exploration error 0.297 → 0.053, exploitation error 0.160 → 0.044, steps 92.5 → 66.1.

Both models get better, faster, and more efficient at roughly the same time. That's a strong signal that a meaningful chunk of "model quality" in agentic settings is actually "harness quality" in disguise.

There's also a smaller, more suggestive experiment on a cooking-task variant where semantic labels are restored. GPT-4.1's success rises from 15.0% to 45.0% with semantics, while Gemini 3.1 Flash Lite's success stays at 25.0%, but its exploitation error drops from 0.091 to 0.015 and its exploration error rises from 0.181 to 0.241. Read cautiously, this hints that different models absorb semantic priors differently: some convert them into better execution, others into more aggressive exploration. It's a single small setup, and the authors are appropriately restrained in what they claim from it.

Practitioner takeaways

  1. Instrument behavior, not just outcomes. Even a rough exploration/exploitation decomposition on your own eval traces will tell you more than aggregate success rate. The paper's metric is one concrete template.
  2. Prompt for the failure mode you actually have. Exploration-focused prompting measurably reduces exploration error in this setup. If your agent is getting stuck short of completion rather than fumbling the last step, prompting for exploration is a cheap first move.
  3. Harness engineering is leverage. Two different base models each gained roughly 30 points of success rate from scaffolding changes alone. Before swapping models, it's worth auditing what the agent can actually see, how it's prompted to act, and how its observations are summarized between steps.
  4. Don't assume equal scores mean equal behavior. Claude Opus 4.6 and Gemini 3.1 Pro both hit 100% here with different strategies. Picking between them for production should probably consider which failure mode you'd rather debug.

Our take

This is an arXiv preprint, and it is not peer-reviewed unless or until an independent publication is confirmed. The benchmark is intentionally symbolic and semantics-free, which is exactly what lets the metric be clean, and exactly why the numbers shouldn't be read as a direct measure of end-to-end real-world utility. The normalized error metrics are behavioral summaries, they complement success rate rather than replacing it. The main experiments use three seeds, and the authors note a larger budget would likely produce cleaner trends.

With those caveats, the core idea is good, and the harness-engineering numbers are hard to argue with. Most agent teams are under-instrumented, and "did it finish?" is a weak signal to debug against. A policy-agnostic metric you can compute from trajectories you're already logging is the kind of thing that should migrate from a paper into internal eval dashboards fairly quickly. The finding that exploration appears to be the dominant bottleneck in this benchmark is the sort of directional claim that changes where a pragmatic team spends its next month, and that's the test a preprint like this has to pass to matter.

Reference: Park et al., "Exploration and Exploitation Errors Are Measurable for Language Model Agents," arXiv:2604.13151.

Building or auditing agent evals?

Our OpenClaw field guide covers agent instrumentation, harness design, and practical evaluation patterns for systems that need more than pass/fail metrics.

Get the Field Guide — $10 →