The easiest mistake in agent design is assuming that once agents can talk to each other, they can coordinate with each other.

That is the seductive story behind a lot of multi-agent demos. One agent plans, another researches, another writes, another approves, and a final one executes. The diagram looks like a team. The transcript looks busy. The demo feels more sophisticated than a single assistant with a long tool list.

But enterprise work is not just conversation. It is a chain of responsibilities across roles, permissions, policies, systems of record, and irreversible state changes. If those responsibilities are not explicit, a multi-agent system can become the worst parts of a meeting and an automation script at the same time: everyone appears involved, but no one clearly owns the outcome.

The next practical question for enterprise AI is not simply “Can my agents communicate?” It is “Does my agent system have an org chart?”

Protocols are arriving before governance

The infrastructure side is moving quickly. Google’s Agent2Agent protocol, or A2A, is designed to let agents communicate, collaborate, securely exchange information, and coordinate actions across enterprise platforms, applications, vendors, and frameworks. Google explicitly frames A2A as complementary to Anthropic’s Model Context Protocol: MCP gives agents access to tools and context, while A2A gives agents a way to interact with other agents.

Microsoft has also backed A2A, describing interoperability as “no longer optional” as organizations scale agent systems. In Microsoft’s framing, A2A can support structured communication: exchanging goals, managing state, invoking actions, and returning results securely and observably across clouds, platforms, frameworks, vendors, and organizational boundaries.

That matters. Enterprises will not run on one model, one vendor, one agent framework, or one clean application boundary. If agents are going to participate in real business processes, they need shared ways to discover capabilities, delegate work, track progress, and pass results.

But a protocol is not an operating model. Connectivity tells agents how to exchange messages. It does not decide who is allowed to approve a refund, who can modify a patient-facing record, who owns the final answer, or what evidence is required before a workflow is closed.

Without those rules, agent-to-agent communication becomes a group chat with APIs.

EntCollabBench points at the real failure modes

A useful research anchor here is EntCollabBench, a May 2026 benchmark paper on role-specialized multi-agent collaboration in enterprise workflows. The paper argues that many existing enterprise benchmarks still evaluate a single agent with broad tool access, while many multi-agent benchmarks miss realistic enterprise constraints: role specialization, access control, stateful business systems, and policy-based approvals.

EntCollabBench simulates a permission-isolated organization with 11 role-specialized agents across six departments. It includes a Workflow subset, where agents collaboratively modify enterprise system states, and an Approval subset, where agents make policy-grounded decisions. Importantly, the benchmark does not rely only on whether the final response sounds good. It evaluates execution traces, database state verification, and deterministic policy adjudication.

That design choice is the point. In a real organization, “the answer looked reasonable” is not enough. The record, authorization, policy result, and workflow closure all have to be checked.

The paper reports that representative LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment. Those are not cosmetic issues. They are the core mechanics of enterprise work.

A multi-agent system fails when the wrong agent receives the task. It fails when the right agent receives the task without the necessary context. It fails when a downstream action is taken with an ambiguous customer ID, invoice number, account state, or policy condition. It fails when three agents each complete their local step but no one verifies the global state. It fails when an “approval” agent recommends a decision but never clearly commits to one.

Those are org-chart failures, not just model failures.

The agent org chart

An agent org chart is not a vanity diagram. It is a control surface. It should define how work moves through the system and where accountability lives.

Start with a role charter for every agent. The charter should say what the agent owns, what it does not own, which tools it can use, which data it can see, and when it must escalate instead of improvising. If an agent’s role cannot be described in a few concrete sentences, it is probably not a role. It is a prompt fragment.

Next, define the permission envelope. Enterprise agents should not all share the same tool belt. A finance agent, support agent, engineering agent, and compliance agent should have different capabilities because the organization itself has different authorities. Permission boundaries are not friction; they are how the system preserves intent.

Then make handoffs typed. A handoff should not be “Can you look into this?” It should have required inputs, expected outputs, allowed assumptions, and failure states. If a scheduling agent hands work to a procurement agent, the receiving agent should know which fields are authoritative, which are provisional, and what must be confirmed before an order is placed.

Approval needs its own owner. Many agent systems blur recommendation and decision. That is dangerous. The system should distinguish the agent that gathers evidence, the agent that interprets policy, the actor that commits the decision, and the log that records why. In regulated or high-trust workflows, that actor may still be a human. The important part is that the decision right is explicit.

Finally, define closure tests. A workflow is not complete because the last agent wrote a summary. It is complete when the target state is verified. The ticket status changed. The database row matches the intended update. The generated document exists. The policy adjudication returned a determinate result. The audit trail has the required evidence.

That is where agent evaluation has to move: from transcript appreciation to state verification.

Do not split agents just to look modern

There is also a simpler warning: not every complex task needs a multi-agent system.

LangChain’s multi-agent documentation is refreshingly direct on this point. It notes that a single agent with the right tools and prompt can often handle tasks that people are tempted to split across agents. Multi-agent patterns become useful when they solve a real engineering problem: context management, distributed development, parallel execution, too many tools for one agent to choose reliably, specialized knowledge, or sequential constraints.

That is a good deployment test. If adding another agent does not reduce context load, clarify ownership, improve permissions, speed parallel work, or enforce a necessary sequence, it may only be adding surface area.

A single accountable agent with a small tool set is often better than five vague agents passing partial summaries around. The goal is not to simulate an organization. The goal is to encode the parts of an organization that make the work safer and more reliable.

What to build first

Start with a single-agent baseline. Measure where it fails. Does it pick the wrong tool because the tool list is too broad? Does it forget domain context? Does it need another department’s authority? Does part of the work run independently enough to parallelize? Does a policy decision require separation from execution?

Only split at those seams.

When you do split, write the handoff contract before you write the prompt. Decide what the upstream agent must provide, what the downstream agent may assume, and what evidence is required to close the loop. Treat A2A, MCP, LangGraph, or any other framework as implementation detail beneath that contract.

Then test the system the way the business experiences it. Do not only review the chat trace. Check the state change. Check the approval. Check the missing-parameter behavior. Check the escalation path. Check whether the audit log would make sense to someone who was not in the conversation.

Multi-agent systems can be powerful because they compose specialized capabilities. They can also be fragile because accountability diffuses at every handoff.

The companies that make agent systems work will not be the ones with the most elaborate agent swarm diagrams. They will be the ones that turn those diagrams into operating models: clear roles, limited permissions, typed handoffs, explicit decisions, and verifiable closure.

Agents need protocols. They also need management structure.

Sources

  • EntCollabBench: https://arxiv.org/abs/2605.08761
  • Google Agent2Agent announcement: https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/
  • Microsoft on A2A for multi-agent apps: https://www.microsoft.com/en-us/microsoft-cloud/blog/2025/05/07/empowering-multi-agent-apps-with-the-open-agent2agent-a2a-protocol/
  • LangChain multi-agent documentation: https://docs.langchain.com/oss/javascript/langchain/multi-agent

Sources

Build Agents That Prove Their Work

If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.

Get the Field Guide - $10 ->