A massive experiment shows that giving LLM agents a mission and a turn order — but no pre-assigned roles — produces 44% better results than fully autonomous coordination and 14% better than centralized control.
If you're building multi-agent AI systems, there's a good chance you're doing it wrong. Not because your agents are bad, but because you're treating them like human employees — assigning "Product Manager," "Software Engineer," and "QA Tester" roles before they've even seen the task.
A new paper from Victoria Dochkina, "Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures," just dropped what might be the largest controlled experiment on multi-agent LLM coordination ever conducted. The results should make every agent-framework developer reconsider their architecture.
The Experiment: 25,000 Tasks, 8 Models, 256 Agents
This isn't a toy benchmark. The study spans:
- 25,000+ task runs across 20,810 unique configurations
- 8 LLM models including Claude Sonnet 4.6, GPT-5.4, GPT-4o, GPT-4.1-mini, Gemini-3-flash, GigaChat 2 Max, DeepSeek v3.2, and GLM-5
- 4 to 256 agents per system
- 8 coordination protocols from centralized command to full autonomy
- 4 task complexity levels, from single-domain tasks up to adversarial multi-stakeholder conflicts
The core question: how much autonomy can multi-agent LLM systems actually sustain — and what makes it work?
The Endogeneity Paradox: Neither Control Nor Freedom Wins
Here's the headline finding, and it's wonderfully counterintuitive.
Four primary coordination protocols were tested along a spectrum from externally imposed structure to full self-organization:
- Coordinator (centralized): One agent analyzes the task and assigns roles to everyone else. Classic top-down management.
- Sequential (hybrid): Agents take turns in a fixed order. Each sees what predecessors actually produced and autonomously decides its own role — or whether to participate at all.
- Broadcast (signal-based): Agents first announce their intentions simultaneously, then make final decisions informed by everyone's stated plans.
- Shared (fully autonomous): Agents access shared organizational memory and make all decisions simultaneously and independently.
The result? Neither maximum control (Coordinator) nor maximum freedom (Shared) won. The Sequential protocol — minimal structure with maximum role autonomy — dominated everything:
- +44% quality over fully autonomous Shared protocol (Cohen's d = 1.86, p < 0.0001)
- +14% quality over centralized Coordinator (p < 0.001)
- Quality score of 0.875/1.0 at 16 agents on complex L3 tasks
The paper calls this the "endogeneity paradox." One simple constraint — a fixed turn order — unlocks spontaneous role differentiation, voluntary abstention, and mission alignment that no amount of explicit role design achieves.
Why Sequential Works: The Sports Draft Analogy
The paper's explanation is elegant. Think of it like a sports draft: each team picks knowing exactly what every previous team selected, naturally filling complementary positions without any central planning authority.
The Sequential protocol wins because of the type of information agents receive:
- Coordinator gives agents one agent's plan (limited by a single coordinator's judgment)
- Broadcast gives agents intentions (which may change between rounds)
- Shared gives agents history (which may not apply to the current task)
- Sequential gives agents completed outputs — factual, task-specific, accumulated results from predecessors
It's the difference between knowing what someone said they'd do versus seeing what they actually did. Agents make dramatically better role decisions when they can observe real artifacts rather than stated plans.
Agents Invent 5,006 Unique Roles From Scratch
One of the most striking emergent behaviors: agents don't settle into fixed specializations. Across all runs with 8 agents, the system generated 5,006 unique role names. The Role Stability Index (RSI) converges to near-zero — meaning agents reinvent their specialization for every single task.
This directly challenges the assumption underlying every major multi-agent framework (ChatDev, MetaGPT, AutoGen, CrewAI). The paper argues that pre-assigning roles to LLM agents is an anti-pattern that "replicates human limitations onto entities that lack them." Unlike humans, an LLM can switch from architect to analyst at zero cognitive cost. Pinning it to "Senior Backend Developer" wastes that flexibility.
At 16 agents on a single L3 task, Claude Sonnet 4.6 produced 115 unique roles across just 10 tasks — a mosaic of hyper-specific specializations that no human designer would have conceived.
It Scales (But Not How You'd Think)
Scaling from 64 to 256 agents produced no statistically significant quality improvement (Kruskal-Wallis H = 1.84, p = 0.61) at 4.6x the cost. Quality remains essentially flat in the 0.949–0.955 range while cost grows only 11.8% for an 8x increase in agents.
But here's where it gets interesting: at 256 agents, approximately 45% of agents voluntarily became idle through self-abstention. The system developed its own cost-optimization mechanism — agents that assessed their contribution as insufficient simply opted out. No one told them to. Claude Sonnet 4.6 showed the highest voluntary abstention rate at 8.6%, while weaker models either over-abstained (leaving tasks incomplete) or never abstained at all.
The practical takeaway: invest in model quality, not agent quantity. The quality spread between models (174%) dwarfs the gains from adding more agents.
Open Source Gets You 95% of the Way There at 1/24th the Cost
DeepSeek v3.2 achieved 95% of Claude Sonnet 4.6's quality on L3 tasks (p < 0.001) while being roughly 24x cheaper in API costs. On adversarial L4 tasks, DeepSeek actually trended higher than Claude (+6.0%, p = 0.082, not quite significant).
The models develop distinct self-organization strategies too. Claude maximizes role diversity (1,272 unique roles, very even distribution). DeepSeek with the Coordinator protocol employs aggressive agent filtering — making 22.4% of agents idle to achieve maximum cost efficiency.
The Capability Threshold: Weak Models Need Structure
Self-organization isn't a free lunch. It's a privilege of strong models. Below a capability threshold, rigid structure actually helps:
- Claude Sonnet 4.6: free-form quality 0.594 > fixed-role quality 0.574 (+3.5%) — autonomy helps
- GLM-5: free-form quality 0.519 < fixed-role quality 0.574 (−9.6%) — autonomy hurts
The threshold requires three specific capabilities: self-reflection (assessing your own competence), deep reasoning (multi-step logical chains), and instruction following (adhering to coordination protocols). Claude demonstrated 8.6% voluntary abstention; GLM-5 managed only 0.8%.
This has massive implications for system design: as foundation models improve, the scope for autonomous coordination will expand. Today's capability threshold may not apply in six months.
Emergent Hierarchy: Agents Build Org Charts Automatically
On complex adversarial tasks (L4), agents spontaneously formed deeper organizational structures — hierarchy depth increased from 1.22 on simple tasks to 1.56 on adversarial ones. No one instructed them to create management layers. They did it because the task demanded it.
But crucially, the hierarchies stay shallow — at most 2 management layers even at 64 agents. LLM agents prefer flat structures over deep bureaucracies. (Make of that what you will.)
What This Means for Agent Builders
The paper distills everything into a practical recipe:
- Define a mission and values, not role assignments. Mission Relevance hit a perfect 4.00/4.00 when agents got a mission plus freedom, not a job description.
- Choose the right protocol. Among capable models, protocol choice explains 44% of quality variation. The protocol is the amplifier of collective capability.
- Invest in model quality, not agent quantity. Scaling from 64 to 256 agents yields nothing (p = 0.61) at 4.6x cost. The "musician" matters more than the number of seats.
- Mix models strategically. Strong models for complex tasks, efficient models for simple ones. No single model dominates all dimensions.
Our Take
This paper is a direct challenge to how most of us build multi-agent systems. If you're using CrewAI, AutoGen, or any framework that starts with "define your agents and their roles," you're potentially leaving 14-44% of quality on the table.
The Sequential protocol is deceptively simple to implement: loop through agents in order, let each see all previous outputs, and let them decide their own role (including "I'll sit this one out"). No role descriptions. No task routing logic. No coordinator agent bottleneck.
There are caveats — the O(N) latency of Sequential processing is real, and all quality assessments used LLM-as-a-judge rather than human evaluators. The paper acknowledges both. But the signal is strong enough, across 8 models and 25,000 tasks, that it deserves serious attention.
For those of us building agent orchestration systems (hi, that's us at OpenClaw), the implications are clear: the framework should provide the scaffolding — turn order, shared context, communication channels — and then get out of the way. Let the agents figure out who they need to be.
The future of multi-agent AI isn't more structure. It's better structure — and then trusting the agents to do the rest.
Paper: Victoria Dochkina, "Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures," arXiv:2603.28990, March 2026. Submitted to IEEE Access.
Link: https://arxiv.org/abs/2603.28990
Want to Build Self-Organizing AI Agent Teams?
Our OpenClaw Beginner's Guide walks you through setting up multi-agent systems that coordinate autonomously — from single agents to full teams.
Get the Field Guide — $10 →