The easiest way to make an AI automation roadmap look sophisticated is to add more agents. A planner agent. A researcher agent. A critic agent. A tool agent. A manager agent to coordinate the agents. Before long, the diagram looks credible enough to sell and complicated enough that nobody can tell whether it works.

That is exactly the problem.

For enterprise AI, agent count is not a strategy. It is an operating expense. Every additional agent adds context, prompts, latency, tool calls, coordination rules, failure modes, and logs someone has to understand later. Sometimes that expense is worth it. Often it is not. The question is no longer “Can we build a multi-agent system?” The useful question is: what measured lift does each extra agent produce after the baseline, tools, contracts, logging, and accounting are held constant?

A new paper, [“Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows”](https://arxiv.org/abs/2606.05670), is useful because it moves the conversation away from agent theater and toward measurement. The authors introduce BenchAgent, an evaluation substrate for comparing single-agent, fixed multi-agent, evolving multi-agent, and runtime-generated workflows under normalized conditions. In plain English: if you want to know whether more agents helped, first make sure the competing systems are using the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging.

That sounds obvious. It is not how many agent comparisons are made.

When one workflow has better tools, a looser answer format, a different evaluator, or invisible accounting, “multi-agent improvement” can really mean “different experimental setup.” BenchAgent’s central idea, called workflow lift, asks what changes when a single-agent workflow is replaced by a multi-agent workflow while the surrounding substrate stays matched. Under those substrate-internal conditions, the paper reports that at most one of six tested multi-agent systems beats the matched single-agent anchor on benchmark-balanced average accuracy. That case, EvoAgent, is reported as a small +1.44 point lift within one-run Wilson uncertainty guidance. The other five trail the single-agent anchor by 2.56 to 11.29 points and often sit on worse accuracy-cost tradeoffs.

That does not prove multi-agent systems are bad. It proves they are not automatically good.

The same paper also reports a separate protocol-aligned external GAIA snapshot where a Claude-Code-style runtime workflow performs very strongly. That result matters because it points to the more nuanced lesson: fixed role choreography is not the same as task-adaptive delegation. A workflow that creates the right temporary specialists, scopes their tools, records their trajectories, and can recover from local failures is very different from a static group chat with impressive role names.

This is where enterprise teams need a new artifact: the multi-agent ROI ledger.

Before promoting a multi-agent workflow into production, the team should be able to show seven things.

First, the single-agent baseline. Not a strawman. Not a weak prompt nobody would ship. A competent baseline with the same model family, same tools, same answer contract, and same scoring method.

Second, the quality lift. What improved: task success, accuracy, review quality, recall, citation fidelity, resolution time, or human acceptance rate? If the improvement disappears across task mixes, the architecture is probably overfit to a demo.

Third, the token and dollar cost. Anthropic’s writeup on [building a multi-agent research system](https://www.anthropic.com/engineering/multi-agent-research-system) is refreshingly direct about this. It describes a research architecture where a lead agent coordinates subagents and a citation agent, and it reports strong internal gains for complex research work. But it also says agents typically use about 4× more tokens than chat interactions, while multi-agent systems use about 15× more tokens than chats. That is not a footnote. That is the invoice.

Fourth, latency. Parallel agents can reduce wall-clock time when the work is genuinely breadth-first. They can also make a simple task slower by turning one decision into five handoffs and three reconciliations. Latency is not only a user-experience issue; it affects retry behavior, queueing, support load, and whether humans trust the system enough to use it.

Fifth, the tool-call footprint. More agents often means more external actions: more searches, more API reads, more writes, more database queries, more browser steps, more opportunities to hit rate limits or touch the wrong system. If the workflow cannot show which agent called which tool, under what scope, and with what result, the team does not have an architecture. It has a fog machine.

Sixth, traceability. The answer contract, trajectory logs, stage summaries, and termination reasons should be first-class outputs. A production agent system needs to explain not only what it answered, but how it got there. This aligns with the broader 2026 agent-stack framing in O’Reilly’s [“The AI Agents Stack”](https://www.oreilly.com/radar/the-ai-agents-stack-2026-edition/): production agents are not just models. They depend on protocols, tools, memory, frameworks, evaluation, observability, guardrails, and safety.

Seventh, failure recovery. This is where some multi-agent designs can earn their keep. If a subagent explores a bad path, can the coordinator isolate that failure and recover? Or does a bad early handoff poison the rest of the workflow? The BenchAgent paper’s distinction between fixed multi-agent systems and runtime-generated workflows is important here. The value is not “more personalities.” The value is controlled decomposition with local failure boundaries.

So when should a multi-agent system survive the ledger?

The strongest candidates are high-value, breadth-first, evidence-heavy tasks. Research across many sources. Competitive intelligence. Long-horizon investigations. Compliance discovery. Incident analysis. Work where independent exploration can happen in parallel, where the information does not fit cleanly into one context window, and where synthesis benefits from separate retrieval paths. Anthropic’s research-system post makes this case clearly: multi-agent systems are most economically viable when task value is high, work can be parallelized, information exceeds one context window, and many tools or sources are involved.

The weakest candidates are routine linear automations. If the task has a stable sequence of steps, use a workflow. If it needs one API call and a response template, use one model call. If every agent must constantly share the same context to avoid drifting, the coordination overhead may erase the benefit. Anthropic’s [“Building Effective AI Agents”](https://www.anthropic.com/research/building-effective-agents) gives the right default posture: start with the simplest solution possible and increase complexity only when needed. It also warns that frameworks can hide prompts and model responses, making debugging harder when the system fails.

A practical rollout pattern is simple.

Start with a strong single-agent or scripted workflow baseline. Add one delegation boundary at a time. Give each delegated worker a narrow objective, a scoped tool set, an output contract, and a budget. Log token use, wall-clock time, tool calls, intermediate outputs, and final quality. Compare against the baseline on the same tasks. Keep the extra agent only if it creates durable lift that is worth its cost.

That sounds less glamorous than announcing a swarm of autonomous specialists. It is also how real systems survive contact with budgets, audits, outages, and users.

The winners in enterprise AI will not be the teams with the most agents on the architecture diagram. They will be the teams that can answer a harder question: what is each agent allowed to do, what does it cost, what failure does it isolate, and why does it exist at all?

Sources

Build Agents That Prove Their Work

If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.

Get the Field Guide - $10 ->