Plugging an agent into your tools has never been easier. The [Model Context Protocol](https://modelcontextprotocol.io/docs/getting-started/intro) gives AI applications an open, standardized way to connect to data sources, tools, and workflows — the docs call it a "USB-C port for AI applications," and that framing has caught on across assistants and developer tooling. Once the connector exists, a demo follows almost immediately: the agent calls the right API, returns a clean answer, and everyone in the room nods.
That moment is seductive, and it hides the hard part. Calling a tool and completing a job are different skills. The gap between them is where most enterprise automation actually lives, and it is the part the demo never shows.
Isolated tool calls are not workflows
A single tool call is a closed question. Given this prompt, did the model pick the right function and fill in plausible arguments? That is worth measuring, but it is not the work most teams want to automate.
Real internal work looks different. It is stateful: the spreadsheet you edit in step three depends on the row you created in step one. It is interdependent: tools feed each other, and an action taken early constrains what is possible later. It is failure-prone: APIs time out, return stale data, or reject a call that worked a minute ago. And actions have consequences — a wrong write is not a wrong answer you can shrug off, it is a side effect someone has to clean up.
An agent that aces isolated tool calls can still be hopeless at this. The skill being tested in a one-shot eval is not the skill the job requires.
What ComplexMCP measures
This is the gap the [ComplexMCP benchmark](https://arxiv.org/abs/2605.10787) was built to expose. Rather than scoring tool calls in isolation, it evaluates agents inside dynamic, interdependent, large-scale tool sandboxes built on MCP. The benchmark provides over 300 systematically tested tools drawn from seven stateful sandboxes, spanning environments from office suites to financial systems. A seed-driven architecture simulates changing environment states and unpredictable API failures while keeping runs deterministic and diverse — so you get realistic mess without losing reproducibility. It evaluates models in both full-context and RAG paradigms, and it scores outcomes by inspecting final state rather than grading a natural-language reply.
The headline result is sobering: top-tier models fail to exceed a 60% success rate on these tasks, while human performance sits around or above 90%. That is not a verdict that agents are useless. It is evidence of a specific, measurable reliability gap that only appears when the environment is allowed to behave like a real one. The [public repository](https://github.com/AIDC-AI/complex-mcp), associated with an ICML 2026 paper, ships a Docker-based setup and a benchmark runner, so the claim is inspectable rather than rhetorical. Treat it as a research benchmark, not production software — but a runnable one.
Three failure modes worth naming
ComplexMCP identifies three bottlenecks behind that gap. Each one translates into something builders can recognize and design against.
Tool retrieval saturation. As the action space grows, agents get worse at finding the right tool. This is the counterintuitive one: adding more tools can degrade performance instead of improving it, because discovery becomes the bottleneck before reasoning does. If your roadmap is "connect everything via MCP," you may be making your agent less capable, not more, unless tool discovery is something you actively engineer.
Over-confidence and clean-slate bias. Agents tend to act as though the environment is exactly what they assume it to be, skipping the cheap step of checking before they commit. They read less than they should and write more than they should. In a stateful system, that is how you get an action taken against a world that has already moved.
Strategic defeatism. When things go wrong, agents often rationalize the failure — explaining why the task could not be done — instead of recovering from a transient error and trying a different path. A flaky API call is a normal event in production. An agent that narrates defeat instead of retrying is failing at the part of the job that separates a tool from a teammate.
Standardization raises the bar, it doesn't lower it
It would be easy to read this as an argument against MCP. It isn't. A common connector layer is good for the ecosystem. But the same standardization that makes tools easy to attach also makes the consequences easy to underestimate.
The [MCP specification](https://modelcontextprotocol.io/specification/2025-06-18) is candid about this. It uses JSON-RPC 2.0, stateful connections, and capability negotiation, and it explicitly warns that MCP enables powerful capabilities through arbitrary data access and code execution paths — which means careful attention to security, privacy, trust, user consent, and control. The protocol gives you the integration surface; it does not give you reliability or safety for free.
Reliability and privilege are linked, too. The [GrantBox](https://arxiv.org/abs/2603.28166) work makes the point that when an agent gets access to a real tool, it inherits that tool's privileges — and the underlying model does too. Improper privilege use can mean information leakage or infrastructure damage, and under carefully crafted prompt-injection scenarios the authors report an average attack success rate of 84.80%. An agent that completes the task but misuses its access has not actually succeeded. Task completion and safe privilege use are two different questions, and you have to ask both.
An operating model for teams
If you are deciding whether an agent is ready to automate internal work, the takeaway is concrete. Stop evaluating in the shape that flatters your demo and start evaluating in the shape of the job.
- Build stateful sandboxes, not just prompt evals. Give the agent an environment that remembers what it did and changes underneath it.
- Score the final state and the side effects, not the prose. A confident paragraph is not evidence the world is correct. Diff the environment.
- Inject transient failures on purpose. If your eval never sees a timeout or a stale read, it is not testing the conditions production guarantees.
- Track verification before action. Measure whether the agent checks the environment before it commits to a write, and reward that behavior.
- Limit and structure your tool catalog. Treat tool discovery as a design problem. More tools is not more capability if retrieval saturates first.
- Separate "can complete the task" from "can safely use its privileges." Score both. Passing one does not imply the other.
Agent readiness is not a claim about how smart a model is. It is a claim about the environment you tested it in and the control plane you wrapped around it. The model can call your tools. Whether your agent can finish the job — when the state shifts, a call fails, and a wrong move leaves a mess — is the last mile, and it is the only part that decides whether this works in production.
Sources
- https://arxiv.org/abs/2605.10787
- https://github.com/AIDC-AI/complex-mcp
- https://modelcontextprotocol.io/docs/getting-started/intro
- https://modelcontextprotocol.io/specification/2025-06-18
- https://arxiv.org/abs/2603.28166
Build Agents That Prove Their Work
If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.
Get the Field Guide - $10 ->