Everyone obsesses over which model to use. GPT-5 or Claude? Gemini or Llama? But a new paper from Oxford and Parameter Lab asks a different question — and the answer should change how you build AI systems: what if the framework you wrap around your model matters just as much as the model itself?
The Problem Nobody Was Measuring
When you build an AI agent system, you make two major choices: which model to use, and which framework to orchestrate it. Think smolagents, LangGraph, LlamaIndex, AutoGen — the scaffolding that handles tool calling, multi-agent coordination, error recovery, and state management.
Every existing benchmark tests models. You swap GPT for Claude, run the same tasks, compare scores. But the framework stays fixed. Nobody was systematically measuring whether LangGraph vs. smolagents vs. LlamaIndex actually changes outcomes.
The MASEval team decided to find out.
The Experiment
The researchers built MASEval — a framework-agnostic evaluation library that treats the entire system as the unit of analysis, not just the model. They tested:
- 3 frameworks: smolagents, LangGraph, LlamaIndex
- 3 models: GPT-5-mini, Gemini 3.0 Flash, Claude Haiku 4.5 (all same capability/price tier)
- 3 benchmarks: MACS (enterprise collaboration), CONVERSE (agent security), MultiAgentBench (coordination + competition)
- 6 domain splits across those benchmarks
That's 54 distinct system configurations, all evaluated head-to-head.
The Headline Finding
Framework choice impacts performance comparably to model choice.
Across all 6 domains:
- Swapping models (same framework) shifted scores by an average of 14.2 percentage points
- Swapping frameworks (same model) shifted scores by an average of 12.4 percentage points
Those numbers are close enough to be statistically comparable. In 2 of 6 domains, framework variability actually exceeded model variability.
The Numbers That Should Worry You
The most dramatic result: Claude Haiku 4.5 on a travel planning benchmark scored 90.4 with smolagents but only 59.5 with LlamaIndex. Same model, same task, same data. The only difference was the framework — and it created a 30.9 percentage point gap.
That's not a rounding error. That's the difference between a system that works and one that doesn't.
Here's a sample of results across the MACS Travel benchmark:
| Framework | GPT-5-mini | Gemini 3.0 Flash | Haiku 4.5 |
|---|---|---|---|
| smolagents | 59.8 | 84.0 | 90.4 |
| LangGraph | 60.8 | 85.8 | 68.3 |
| LlamaIndex | 71.0 | 74.7 | 59.5 |
No single framework wins everywhere. smolagents dominated with Haiku but underperformed with GPT-5-mini. LlamaIndex was the opposite. The interactions are complex and unpredictable.
Why This Happens — A Case Study
The paper documents a specific failure mode that illustrates why frameworks matter so much.
smolagents forces a tool call at every agent step. Most models handle this fine. But GPT-5-mini, combined with this mandatory tool-calling convention, entered a degenerate loop: it would ask clarifying questions, hit the turn limit, receive an error message, and then retry the same tool call up to 23 times — just rephrasing instead of changing strategy.
The result: it still completed the task, but consumed 10x more tokens than other model-framework combinations. The failure wasn't in the model (it worked fine on LangGraph). It wasn't in the framework (it worked fine with Haiku). It was in the interaction between the two.
"Framework conventions can combine with model tendencies to produce failures that neither component would exhibit in isolation."
What MASEval Actually Is
Beyond the findings, MASEval is an open-source tool (MIT license) for running these comparisons yourself. Key design decisions:
- System as unit of analysis — evaluates the complete stack, not just the model
- Framework-agnostic — bring your own framework, model provider, and logging
- Trace-first — every agent gets independent message history for debugging
- Adaptive testing — can estimate system performance within ~2 percentage points using only 1% of test tasks (via Item Response Theory)
For benchmark builders, it cuts implementation effort by 35–57%. For benchmark consumers (people running existing evals), the interface layer reduction is 83–91%.
What This Means for Practitioners
If you're building AI agent systems in production, this paper has three implications:
Benchmark your full stack, not just your model. If you're evaluating whether to switch from GPT to Claude, you need to test it within your actual framework. Model leaderboard scores tell you almost nothing about how it'll perform in your specific orchestration setup.
Framework selection is a first-class engineering decision. It's not just about developer experience or ecosystem maturity. Your choice of LangGraph vs. smolagents vs. LlamaIndex can swing performance by 30+ points on real tasks. Treat it with the same rigor as model selection.
Watch for interaction effects. The GPT-5-mini + smolagents failure mode wasn't predictable from evaluating either component alone. When you change one part of your stack, retest the whole system. Compositional behavior is not the sum of its parts.
The era of model-only benchmarks should be ending. MASEval makes a compelling case that we've been measuring the wrong thing — or at least, only half the right thing. Your AI agent's performance isn't just about having the best brain. It's about the entire nervous system.
MASEval is available at github.com/parameterlab/MASEval under MIT license.
Building Multi-Agent AI Systems?
The OpenClaw Field Guide covers real-world agent orchestration, framework selection, and production deployment patterns.
Get the Field Guide — $24 →