There's a narrative in the AI agent space that goes something like this: if you give a language model the ability to interact with its environment step by step — observing results, adjusting course, retrying when things go wrong — it should perform dramatically better than one-shot generation. After all, that's exactly how coding agents like Codex and Claude Code work. They run code, see errors, fix them, and iterate to success. Surely the same principle transfers to other domains?
A new paper from the Austrian Institute of Technology, "Agentic LLM Planning via Step-Wise PDDL Simulation", puts this assumption to the test in one of AI's most studied planning domains. The results should make anyone building agentic systems sit up and think carefully about what kind of feedback actually matters.
The Experiment: Blocks, Plans, and a 180-Second Clock
The researchers built PyPDDLEngine, an open-source PDDL (Planning Domain Definition Language) simulation engine that exposes seven operations as tool calls through a Model Context Protocol (MCP) interface. Instead of asking an LLM to generate a complete action plan upfront, the engine lets the model execute one action at a time, observe the resulting state, and decide what to do next — including resetting to start over.
They tested four approaches on 102 Blocksworld instances from the International Planning Competition, all under a uniform 180-second budget:
| Approach | Success Rate | How It Works |
|---|---|---|
| Fast Downward (classical planner) | 85.3% | Systematic symbolic search — the gold standard |
| Agentic LLM | 66.7% | LLM picks one action at a time via PyPDDLEngine, observes state |
| Direct LLM | 63.7% | LLM generates complete plan in one shot, retry on failure |
The agentic approach — the one with full step-by-step environmental interaction — beats the direct approach by exactly 3 percentage points. At a cost of 5.7x more tokens per solution.
Three percent. That's the gain from giving a language model eyes, hands, and the ability to restart.
The Numbers Tell a Surprising Story
When you dig into the data, the story gets even more interesting. Here's how the four approaches break down across difficulty levels:
- Easy instances (0-20 blocks): Both LLM approaches perform similarly. The agentic advantage is essentially zero here.
- Mid-range (20-60 blocks): The agentic approach tracks slightly above direct, but both decline steadily while Fast Downward maintains 100% success through block 70.
- Hard instances (80-90 blocks): The advantage actually inverts — the agentic approach succeeds on only 20% of instances while the direct approach hits 50%.
The token cost difference is significant: the direct approach averages 28,488 tokens per run versus 169,864 for the agentic approach — nearly 6x. Normalized per solved instance, it's 44,705 vs. 254,796 tokens.
Key Finding: The three additional instances the agentic approach solves over direct cost approximately 14.4 million additional tokens in total. That's roughly $4-8 at current API pricing for three extra solved puzzles.
The Plan Quality Paradox
Here's where it gets strange. On the 49 instances that all four approaches solved (the "co-solved set"), both LLM approaches produced shorter plans than the classical planner's optimized output.
Fast Downward's seq-sat-lama-2011 configuration actively iterates to shorten plans within the time budget. It's specifically designed to improve plan quality. Yet both the direct LLM and agentic LLM beat it on plan length across most difficulty levels.
The researchers' explanation is uncomfortable but compelling: the LLMs aren't planning — they're remembering.
Blocksworld is one of the most extensively studied domains in AI planning literature. It appears in textbooks, papers, and tutorials going back decades. The LLMs have almost certainly seen optimal or near-optimal Blocksworld solutions during training. When they succeed, they're recalling patterns from training data. When they fail, no amount of step-by-step feedback helps them recover — because the model never had a genuine planning algorithm to begin with.
"When action names are syntactically relabelled, success rates collapse to near zero — pointing to approximate retrieval from training data rather than genuine reasoning."
This is consistent with prior work by Valmeekam et al. showing that LLM planning performance collapses when you simply rename the actions to unfamiliar terms. The "planning" was pattern matching all along.
Why Coding Agents Work and Planning Agents Don't
This is the paper's most valuable insight, and it has direct implications for anyone building AI agent systems.
Coding agents — the ones achieving impressive results on real-world programming benchmarks — benefit from a specific type of feedback: externally grounded signals. A failing test case, a compiler error, a runtime exception. These come from the environment itself. The model doesn't have to judge its own work. An external system says "this is wrong, and here's exactly how."
PDDL step-by-step feedback is fundamentally different. When the LLM executes an action in the simulation and observes the new state, all it learns is that the action was applicable. The feedback says "yes, you can do that." It doesn't say whether doing it was a good idea. It doesn't indicate distance from the goal. It doesn't flag unproductive trajectories.
The model is left to evaluate its own progress — and it's bad at that.
The Feedback Quality Principle: Agentic gains scale with the quality and directionality of environmental feedback. Self-assessed progress is not external verification. This is why coding agents leap ahead while planning agents barely inch forward.
The paper points to Reflexion-style work demonstrating that agents guided by test-runner feedback achieve large performance gains through verbal reinforcement — without any weight updates. The key ingredient isn't the agent loop. It's the signal quality.
The Early Exit Problem
The agentic approach introduces a failure mode that doesn't exist in any other configuration: early exit. On 6 instances, the model decides the problem is unsolvable and stops before the time budget expires.
On 4 of those 6 instances, the direct approach (which just keeps retrying) eventually finds a valid plan. The agentic model's self-assessment of unsolvability was factually wrong in the majority of cases.
This echoes a broader finding from Stechly et al.: asking LLMs to critique their own unexecuted plans doesn't improve performance. The gains from iterative prompting come from repeated sampling under an external verifier, not from the critique itself. Self-correction without external verification is unreliable.
What This Means for Agent Builders
If you're building agentic AI systems, this paper gives you a concrete design principle:
1. Audit Your Feedback Loops
Not all tool-use feedback is created equal. Ask yourself: when my agent takes an action and observes the result, is the feedback externally grounded (produced by the environment independent of the model's judgment) or self-assessed (the model interpreting its own output)?
- High-quality feedback: Test results, compiler errors, API response codes, user behavior metrics, database query results
- Low-quality feedback: State observations the model must interpret, progress assessments the model generates about itself, "did that look right?" reflections
2. Don't Assume the Coding Agent Pattern Transfers
The success of coding agents has created a general expectation that agentic loops improve everything. They don't. The magic ingredient in coding agents isn't the loop — it's the compiler and test suite providing unambiguous, externally grounded feedback. Domains without that kind of signal won't see the same gains.
3. Invest in Better Signals, Not More Iterations
The paper suggests a concrete next step: augmenting PyPDDLEngine with goal-distance heuristics in per-step feedback. Instead of just "action applied successfully," tell the model "you are now 12 steps from the goal" or "this action moved you further from the goal." That's the kind of externally grounded progress signal that could actually help.
For your own systems, the equivalent question is: what objective metric can you inject into the feedback loop that the model doesn't have to generate itself?
4. Recognize Memorization Masquerading as Capability
When your agent handles familiar task patterns effortlessly but falls apart on novel variations, that's the memorization signature. The LLMs in this study produced near-optimal plans on Blocksworld — a domain saturated in their training data — yet couldn't recover when problems exceeded their training distribution.
Test your agents on unfamiliar variations of their target tasks. If performance craters, your agent is doing retrieval, not reasoning.
The Bottom Line
Current LLM planning agents function as what the researchers call "adaptive navigators of familiar problem spaces rather than general-purpose planners." They work brilliantly on problems they've seen before and fail on problems they haven't. Step-by-step interaction doesn't fundamentally change this — it just costs more tokens.
The path forward isn't more agent loops. It's better feedback signals. Externally grounded, objective, progress-indicating signals that don't depend on the model's self-assessment. That's the difference between a coding agent that improves with each iteration and a planning agent that spins its wheels.
Paper Details
- Title: Agentic LLM Planning via Step-Wise PDDL Simulation: An Empirical Characterisation
- Authors: Kai Göbel, Pierrick Lorang, Patrik Zips, Tobias Glück (AIT Austrian Institute of Technology)
- Published: March 6, 2026
- arXiv: 2603.06064
- Code: PyPDDLEngine on GitHub
Build AI Agents That Actually Work
Learn how to set up, configure, and deploy production AI agent systems with proper feedback loops and multi-agent orchestration.
Get the Field Guide — $24 →