If you run an agent platform, you have probably noticed something uncomfortable. Two agents with nearly identical descriptions behave nothing alike. One returns a clean structured result on a finance lookup. The other hangs on rate limits, swallows the error, and emits a polite hallucination. Your retrieval layer, which scored them as 0.91 and 0.89 against the user query, has no idea anything is wrong.

The marketing copy is not the agent

AgentSearchBench is a new preprint and benchmark that puts numbers on this gap. It assembles 9,759 agents with 7,867 executable interfaces, 2,952 executable task queries, 259 task descriptions, and a corpus of 66,740 execution runs. The goal is straightforward and a little subversive. Stop grading agent retrievers on whether their top result reads like the request. Start grading them on whether the agent they picked actually completes the task.

For anyone shipping a routing layer in 2026, this is the more honest evaluation. It is also the harder one.

Why description-based agent search breaks

Tool retrieval, the well-studied cousin of this problem, works reasonably well with embedding similarity because tools are mostly atomic. A single function with a fixed signature either matches a request or it does not. Agents are not like that. The benchmark frames the difference clearly. Agent capability is compositional and execution-dependent. A documented capability like "summarizes financial filings" sits on top of model choice, prompt scaffolding, retrieval plumbing, retries, tool permissions, and whatever the author last tweaked at three in the morning.

Three failure modes follow from that.

The first is overclaim. Agent cards advertise capabilities the underlying implementation cannot reliably deliver. Embedding the description gets you a confident match to something that fails on contact.

The second is underclaim. An agent that quietly handles a niche capability never gets surfaced because nobody wrote the right keywords into its README.

The third is the compositional gap. A request can require chaining behaviors that no single description summarizes. The textual signal is partial by construction. Real relevance only resolves at execution time.

AgentSearchBench's response is to ground its labels in actual runs. Relevance is whether an agent successfully completes a task on its executable interface. Documentation alignment, the gap between what an agent says it does and what it actually does on shared inputs, becomes an auxiliary signal rather than the primary one. That single design decision is what separates this benchmark from yet another similarity-flavored leaderboard.

How the benchmark is built

The numbers worth keeping in mind:

  • 9,759 agents, with 7,867 of them exposing an interface you can actually call.
  • 2,952 task queries paired with 259 longer task descriptions.
  • 66,740 execution runs grounding the relevance labels.

That last number is the load-bearing one. Execution traces let the benchmark distinguish a plausible match from a working match. They also create the headroom to evaluate retrievers and rerankers under two different prompt regimes. The short task queries look like what real users type into a router. The longer task descriptions look like what an autonomous planner might emit when delegating a subtask, with more context, more constraints, and more compositional structure.

Those two regimes are not interchangeable. The benchmark shows the best retriever drops from 28.87 NDCG@20 on task queries to 17.21 on task descriptions. A safe read is that longer, richer requests are not easier for retrievers, even though they carry more semantic content. They expose more of the compositional surface where description similarity stops being a useful proxy.

The leaderboard tells on itself

The published leaderboard is more interesting than its headline rankings. ToolRet leads retrieval on both task queries and task descriptions, which is a respectable result for a model that was already strong on the tool-retrieval side. The fun starts in the reranking column.

RankGPT GPT-5.2 dominates task-description reranking on the standard ranking metrics. NDCG@5 of 64.66 and NDCG@20 of 84.69 are not close. Then you look at Completeness@5 and it reads 0.00. The reranker that wins by every conventional measure of ordering quality fails to put a single agent that can actually complete the task in its top five.

That is not a typo. That is the entire thesis of the benchmark in one row.

For contrast, Qwen Reranker 4B leads task-query Completeness@5 at 62.50. Different objective, different winner. If you are a platform team and you have been quietly using NDCG as a proxy for "did we route this well," the AgentSearchBench numbers should make you nervous.

The execution-aware probing results push further on this. When the benchmark surfaces execution evidence into the reranking signal, BGE Reranker v2 moves from 57.93 to 58.16 NDCG@5. Tool-Rank 8B goes from 60.82 to 61.71. Qwen Reranker 4B goes from 60.96 to 61.91. The lifts are small in absolute terms, and the authors are careful not to overclaim. What matters is the direction. Execution-aware probing helps across architectures. The signal is real even when it is not yet large.

What this means for builders

Two practical implications fall out of the benchmark design.

The first is that NDCG-style rankings are not enough. If your routing evaluation only tracks ordering quality against textual labels, you can ship a model that wins your dashboard and silently routes to agents that cannot finish the work. Whatever offline metric you use needs a completion-grounded counterpart, even if you have to bootstrap it from a small set of executable probes.

The second is that documentation-performance alignment is its own asset. The benchmark treats it as auxiliary, but in production it is closer to a leading indicator. Agents whose actual behavior drifts from their cards are the ones that surprise routers in bad ways. A registry that tracks this gap, even crudely, gives a router something to discount.

A reasonable working pattern, consistent with what the benchmark suggests, looks something like this:

  • Keep the description embedding as a fast first-pass filter.
  • Run a reranker that has access to recent execution outcomes, not just text.
  • Maintain a per-agent record of completion rates on probe tasks similar to incoming queries.
  • Track the doc-versus-behavior gap and let it act as a soft penalty when an agent's textual claims have started to outpace its actual runs.

None of this requires throwing out your current retriever. It does require admitting that the retriever cannot be the whole story.

Limitations worth keeping in mind

This is a preprint and a benchmark. The execution runs ground the relevance labels in something more honest than human judgment of descriptions, but they are still bounded by the test environments the authors built. Agents that depend on flaky external services, on stateful sessions, or on long horizons will be measured imperfectly. The 259 task descriptions are a small set compared to the 2,952 task queries. The execution-aware probing lifts are modest, and the leaderboard will move as more rerankers are evaluated. Treat the absolute numbers as a snapshot, not a verdict.

The conceptual point survives all of these caveats. A retriever that never sees execution evidence is operating on a partial signal by design.

The practitioner takeaway

If you are building an agent registry today, the most useful thing AgentSearchBench gives you is permission to stop treating description similarity as the primary signal. The benchmark's most striking row, a top task-description reranker scoring zero on Completeness@5 while topping NDCG, is the kind of result that should change a roadmap rather than just a slide.

Bring execution evidence into the loop. Track the gap between what your agents say and what they do. Evaluate your routing on whether the chosen agent finished the work, not on whether its card matched the request. The infrastructure for this is genuinely more involved than a vector index, but the alternative is a registry that confidently misroutes at scale.

If you want a concrete walkthrough of the patterns we use to wire execution traces, completion probes, and doc-behavior drift into a working router, the OpenClaw guide is where Alchemic keeps that playbook current.

Build a Registry That Knows What Actually Worked

If you are wiring up an agent layer and feeling the limits of description-only search, the OpenClaw guide from Alchemic walks through execution-aware routing patterns, agent registries, and practical orchestration design. Read the guide and bring evidence into your routing loop.

Get the Field Guide — $10 →