If you've ever tried to get multiple AI agents to collaborate on a codebase, you already know the punchline: they break each other's work. One agent renames a function while another is still calling the old name. Both agents produce correct code in isolation. Together? A dumpster fire.
A new paper from Carnegie Mellon — "Effective Strategies for Asynchronous Software Engineering Agents" by Jiayi Geng and Graham Neubig — tackles this head-on. Their framework, CAID (Centralized Asynchronous Isolated Delegation), improves multi-agent accuracy by 26.7% on paper reproduction tasks and 14.3% on Python library development, without changing the underlying model. The secret ingredient? The same collaboration tools human developers have used for decades: git worktree, git merge, and dependency graphs.
Let's break it down.
The Problem: Why Multi-Agent Coding Is Harder Than It Looks
Single-agent systems for software engineering have gotten remarkably capable. Tools like OpenHands can resolve GitHub issues, implement features, and even build small applications. But long-horizon tasks — implementing an entire library from scratch, reproducing a research paper's experiments — remain brutal for a single agent. They take forever, and the agent's reasoning tends to degrade over extended trajectories.
The obvious answer is parallelism: split the work, let multiple agents tackle different parts simultaneously. But here's where it gets ugly.
When multiple agents share a workspace, their concurrent edits create silent conflicts. Not merge conflicts that git catches — worse. One agent modifies a data structure that another agent's code assumes is unchanged. Both agents' code passes in isolation. The integrated result throws runtime errors that neither agent anticipated.
Previous multi-agent frameworks tried to manage this through dialogue, role-based protocols, or instruction-level constraints ("hey, don't touch this file"). The CMU team's key insight: linguistic coordination isn't enough. You need physical isolation.
How CAID Works: Three Primitives from Human Engineering
CAID maps directly from practices that human software teams take for granted:
1. Centralized Task Delegation via Dependency Graphs
A central manager agent analyzes the repository structure and builds a directed dependency graph. Nodes represent units of work; edges encode dependencies. A task becomes eligible for delegation only when all its upstream dependencies are complete and merged.
The manager prioritizes tasks that enable earlier test execution or sit closer to the upstream end of the dependency chain. Files with strong or circular dependencies get grouped and assigned to the same engineer to minimize cross-agent coordination.
2. Isolated Workspaces via git worktree
Each engineer agent operates in its own git worktree — a fully isolated, versioned copy of the repository branching off main. Parallel edits are physically separated. No agent can accidentally overwrite another's intermediate state.
This is the critical differentiator. The paper shows that "soft isolation" (just telling agents to work on different files via prompt instructions) actually performs worse than a single agent on open-ended tasks like PaperBench. Physical isolation via git worktree is what makes the difference.
3. Structured Integration via git merge + Tests
When an engineer finishes its task, it submits a git commit. The manager attempts a git merge into main. If there's a merge conflict, the engineer who produced the conflicting commit resolves it — pulling the latest main, fixing conflicts locally, and resubmitting.
Every integration is test-gated. Failed tests must be resolved before the commit lands on main. This means the main branch is always in a known-good state, and integration failures surface immediately with concrete test signals tied to specific commits.
Communication between manager and engineers uses structured JSON rather than free-form dialogue. Task assignments include file paths, target functions, and dependency information — all machine-parsable and programmatically validatable.
The Results: Numbers That Actually Matter
The team evaluated CAID on two demanding benchmarks using the OpenHands agent SDK (v1.11.0) with three models: Claude 4.5 Sonnet, MiniMax 2.5, and GLM 4.7.
PaperBench (Reproducing Research Papers)
| Model | Single Agent | CAID (2 Engineers) | Improvement |
|---|---|---|---|
| Claude 4.5 Sonnet | 57.2% | 63.3% | +6.1 |
| MiniMax 2.5 | 10.4% | 36.7% | +26.3 |
| GLM 4.7 | 38.0% | 45.4% | +7.4 |
The MiniMax result is wild — CAID more than tripled its score. Even for the strongest model (Claude), the improvement is meaningful.
Commit0-Lite (Building Python Libraries from Scratch)
| Model | Single Agent | CAID (4 Engineers) | Improvement |
|---|---|---|---|
| Claude 4.5 Sonnet | 53.1% | 59.1% | +6.0 |
| MiniMax 2.5 | 42.3% | 57.0% | +14.7 |
| GLM 4.7 | 42.8% | 46.5% | +3.7 |
Key Findings That Will Change How You Think About Multi-Agent Systems
1. Doubling a single agent's iterations doesn't help. Going from 100 to 200 iterations yields marginal (sometimes negative) improvements. The bottleneck isn't compute time — it's architectural.
2. "Try single first, then multi" is a trap. The sequential strategy (run single-agent, then fall back to CAID) costs nearly double the runtime and money but barely improves over running CAID from the start. If you're building long-horizon multi-agent workflows, commit to coordination upfront.
3. Soft isolation fails on hard tasks. On Commit0 (where file structures are explicit), prompt-based isolation helps a little (53.1% → 56.1%). On PaperBench (where structure must be inferred), it actually hurts performance (57.2% → 55.5%). Only physical git worktree isolation consistently improves results.
4. More agents isn't always better. On Commit0, 4 engineers is optimal — 8 engineers actually degrades performance. On PaperBench, 2 engineers hits the sweet spot. The bottleneck shifts to the manager's coordination capacity.
5. Delegation quality is make-or-break. The paper shows two CAID runs on the same minitorch repository producing wildly different results (8.7% vs. 34.3%) based entirely on which files the manager chose to prioritize. Identifying high-impact dependencies early is critical.
What This Means for Practitioners
If you're building multi-agent coding systems (and if you're reading this blog, you probably are or want to), here's what to take away:
Use version control as your coordination layer, not chat. The paper demonstrates that SWE primitives — git worktree, git commit, git merge, dependency graphs, test suites — map directly onto the coordination mechanisms that multi-agent systems need. Don't reinvent this wheel with prompt engineering.
Isolate first, communicate second. The overwhelming finding is that physical workspace isolation matters more than clever communication protocols. Set up isolated environments (branches, worktrees, containers) before worrying about how agents talk to each other.
Match parallelism to task structure. Don't throw 8 agents at every problem. Analyze the dependency structure and choose a parallelism level that your manager agent can actually coordinate. For most tasks, 2-4 concurrent workers is the sweet spot.
Invest in your manager agent. The central manager's ability to decompose tasks, identify dependencies, and prioritize high-impact work is the single biggest lever for system performance. A mediocre manager with great engineers underperforms a great manager with mediocre engineers.
Design for merge, not for avoiding conflict. Conflicts will happen. The question is whether your system handles them gracefully (explicit merge + test gates) or silently (shared workspace corruption). Always choose the explicit path.
Our Take
This paper is refreshingly practical. Instead of proposing yet another novel communication protocol or role-playing framework, Geng and Neubig looked at what human developers have been doing for decades and said: "Why aren't we doing this?"
The answer — that existing SWE primitives are the right coordination substrate for multi-agent systems — feels obvious in hindsight. But the empirical evidence is strong: branch-and-merge coordination, grounded in git worktree and test-gated merges, consistently outperforms both single-agent systems and prompt-based multi-agent coordination.
For the OpenClaw and agent community, this validates an architectural direction many of us have been gravitating toward: treat your agents like developers on a team, not like threads sharing memory. Give them isolated workspaces, structured handoffs, and executable verification. The tools already exist.
What's also notable is the cost reality. CAID does cost more in API spend than single-agent runs — the coordination overhead is real. But the performance gains are substantial, and the paper makes a compelling case that you should commit to multi-agent coordination upfront rather than using it as a fallback. For production workloads where correctness matters more than token efficiency, that's a trade worth making.
The limitations are honest: cost optimization remains an open problem, the manager's delegation relies on prompt heuristics rather than learned policies, and generalization beyond coding tasks is unproven. But as a foundation for building better multi-agent engineering systems, CAID sets a clear and actionable standard.
Paper: Jiayi Geng, Graham Neubig. "Effective Strategies for Asynchronous Software Engineering Agents." Carnegie Mellon University, 2026. arXiv:2603.21489
Code: github.com/JiayiGeng/CAID
Build Your Own AI Agent System
Learn how to deploy and orchestrate AI agents on your own infrastructure with our complete OpenClaw setup guide.
Get the Field Guide — $10 →