When an AI agent fails at a real workflow, the usual reflex is to reach for a stronger model. Bigger context window. More reasoning. A longer system prompt. Another approval step.
Sometimes that helps. But the more durable fix is often less glamorous: make the environment less ambiguous.
Anthropic’s recent research post, [“Paving the way for agents in biology”](https://www.anthropic.com/research/agents-in-biology), is useful because it moves the agent conversation out of generic demos and into a domain where small retrieval mistakes matter. The case study is viral sequence data. The lesson applies much more broadly: agents do not just need intelligence. They need deterministic rails for retrieving, filtering, preserving, and replaying the facts they act on.
A scientific agent that has to click through a browser, infer hidden filtering rules, reconcile inconsistent metadata, and preserve accessions correctly is not merely doing “reasoning.” It is driving through infrastructure built for expert humans. In that setting, a smarter model may still take a wrong turn.
The biology lesson: the substrate is part of the system
The underlying paper, [“Deterministic Access to Global Viral Sequence Data Enables Robust Agentic Scientific Discovery”](https://arxiv.org/pdf/2606.06749), focuses on NCBI Virus-style viral sequence retrieval. This is not a toy problem. Viral genome resources support outbreak response, genomic surveillance, vaccine design, evolutionary analysis, and model-training data construction.
The problem is that many high-value retrieval workflows remain optimized for interactive use. A human virologist may know which web filters to apply, which identifiers matter, which segment names are ambiguous, and which records should be treated as partial or complete. An agent sees a task and a maze.
The authors introduced two pieces of infrastructure:
- VirBench, a benchmark of 120 manually curated viral retrieval queries with ground-truth sequence counts.
gget virus, an open-source command-line and Python retrieval layer designed to make those queries deterministic and reproducible.
The reported results are the important part. Without the deterministic retrieval layer, autonomous agent accuracy varied widely across systems, from 16.9% for Claude Sonnet 4 to 91.3% for GPT-5.5. With gget virus, accuracy rose to at least 90.0% across all evaluated systems and as high as 99.7% for GPT-5.5.
That is not just a benchmark win. It is an architecture lesson.
Once the retrieval layer became deterministic, model choice mattered less. The agent’s job shifted from improvising around human-oriented infrastructure to correctly invoking a tool and preserving its output. The remaining errors were not magically eliminated, but they moved into a smaller, more observable class: did the agent call the right interface, with the right parameters, and avoid reinterpreting the result?
That is exactly where production teams want failures to live.
Creativity belongs above the retrieval layer
The wrong takeaway is “biology needs better agents.” The better takeaway is that every serious agent workflow has layers that should not be creative.
A model can be creative when it forms hypotheses, drafts explanations, generates plans, or proposes next experiments. It should not be creative about whether a query returned 4,812 records or 4,821. It should not improvise around genome build conventions. It should not silently mix record classes because two databases use similar labels. It should not reconstruct browser filtering behavior from memory when the task requires exact retrieval.
The same pattern shows up outside biology.
Enterprise agents are often asked to answer questions across CRMs, ticketing systems, data warehouses, policy documents, contract repositories, and operational dashboards. These systems also contain hidden assumptions: stale fields, duplicate records, inconsistent names, permission-dependent views, undocumented filters, and workflows that live in someone’s muscle memory.
If the agent has to infer all of that at runtime, the organization has built a reasoning problem where it needed an interface problem solved.
Retrieval rails are the boring layer that makes agentic work reliable. They include:
- versioned query semantics;
- typed inputs and typed outputs;
- source snapshots or database-state references;
- stable identifiers and provenance fields;
- exact-match filtering rules;
- raw output preservation;
- replayable execution logs;
- benchmark tasks with known-good answers.
None of these feel like science fiction. That is the point. Reliable agents are going to depend on a lot of infrastructure that looks more like database engineering, data governance, QA, and observability than prompt craft.
Longer autonomy raises the price of ambiguity
This matters more as agents work for longer stretches. In [“Measuring AI agent autonomy in practice”](https://www.anthropic.com/research/measuring-agent-autonomy), Anthropic reports that the longest-running Claude Code sessions grew substantially over a short period, while typical sessions stayed short. The same analysis argues that effective oversight will require post-deployment monitoring infrastructure and better human-agent interaction patterns.
There is a subtle but important shift here. When an agent performs one small step, manual approval can catch a lot. When an agent performs a long chain of retrieval, transformation, analysis, and action, approving every micro-step becomes impractical. Oversight has to move into the design of the environment.
That means the system should make it easy to answer questions like:
- What exact data did the agent retrieve?
- Which filters were applied?
- Which source state was used?
- Did the agent preserve the raw result or summarize it prematurely?
- Can we rerun the same query and get the same answer?
- Did the agent deviate from the approved retrieval path?
If those questions cannot be answered, the organization does not have an autonomous workflow. It has a transcript and a hope.
What to build before swapping models
For teams building agents around scientific, healthcare, compliance, finance, or operational data, the practical checklist is straightforward.
First, identify every browser-only or expert-memory step in the workflow. If a human has to say “go to this dashboard, click this filter, exclude those records, then export the CSV,” that is a candidate for a deterministic interface.
Second, define the retrieval contract. Inputs should be explicit. Outputs should be typed. Edge cases should be named. If the result depends on database state, record the state or snapshot reference. If filtering rules are domain-specific, encode them rather than asking the model to rediscover them.
Third, keep the raw result attached to the agent’s later reasoning. The agent can summarize, explain, or transform, but the system should preserve the original IDs, accessions, rows, documents, or records. This is what makes review and replay possible.
Fourth, test retrieval separately from reasoning. VirBench is powerful because it isolates a concrete failure mode: can the system retrieve the right records? Enterprise teams can do the same with known tickets, contracts, claims, orders, patients, or policy clauses. Do not wait for an end-to-end business outcome to discover that the agent’s first data pull was wrong.
Finally, govern the rail as infrastructure. NIST’s [AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework) emphasizes trustworthiness across design, development, use, and evaluation. Deterministic retrieval is one way to make that concrete. It turns vague agent reliability goals into inspectable controls.
The question to ask first
The most useful question is not “which model is smart enough to do this?”
It is: “is the environment precise enough for any model to do this reliably?”
If the answer is no, a better model may only hide the problem for longer. It may produce more fluent explanations of a wrong dataset. It may navigate the maze more confidently. It may fail less often, but still fail in ways that are hard to reproduce.
The biology work is a reminder that agent progress is not only a model curve. It is an infrastructure curve. When the retrieval layer becomes deterministic, the agent’s intelligence can be used where intelligence belongs: planning, interpretation, synthesis, and judgment.
For scientific and enterprise teams, that is the path from impressive demos to dependable workflows. Build the rails first. Then let the agent run.
Sources
- https://www.anthropic.com/research/agents-in-biology
- https://arxiv.org/pdf/2606.06749
- https://www.anthropic.com/research/measuring-agent-autonomy
- https://www.nist.gov/itl/ai-risk-management-framework
Build Agents That Prove Their Work
If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.
Get the Field Guide - $10 ->