Long-horizon agents have a memory problem that no one likes to name out loud. They solve a sub-task in episode forty-seven, then solve the same sub-task again, from scratch, in episode forty-eight. The model technically learned something. The system retained nothing. A new preprint, Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks, submitted to arXiv on 2026-04-22, takes that complaint seriously. It proposes COSPLAY, a co-evolution framework that splits the agent into two cooperating roles: one that decides what to do now, and one that mines reusable skills from prior rollouts and keeps them in a maintainable bank.
The framing matters more than the benchmark numbers, and we will get to why. But first, the basics.
What the paper actually proposes
COSPLAY is a two-agent co-evolution architecture built on top of a Qwen3-8B base model. One agent is the decision agent: it interacts with the environment, plans actions, and produces rollouts. The second agent is the skill-bank agent: it observes those rollouts and extracts reusable procedural skills, then maintains them as entries in a skill bank with explicit contracts describing when each skill applies and what it expects.
The mining pipeline works on unlabeled rollouts. That detail is important. The skill-bank agent does not require curated demonstrations or a hand-built skill library. It looks at what the decision agent already did, identifies recurring useful patterns, and lifts them into named, contracted skills the decision agent can call back into on later episodes. The two agents then co-evolve: better skills make the decision agent stronger; stronger rollouts produce better candidate skills.
The authors evaluate on six game environments: 2048, Tetris, Candy Crush, Super Mario Bros, Avalon, and Diplomacy. The first four are single-player. The last two are multi-player social and strategic games.
What the benchmark numbers show
On the single-player game benchmarks, the abstract reports an average reward improvement of more than 25.1% over four frontier LLM baselines. That is the headline number, and it is the one that will get screenshotted. It is worth holding in context.
Read this carefully. The 25.1% average improvement is on single-player game benchmarks against four frontier LLM baselines, with COSPLAY itself built on a Qwen3-8B base. It is not a claim that an 8B model now beats frontier models on general reasoning. It is a claim that the COSPLAY architecture, on top of Qwen3-8B, beats those baselines on this specific game suite.
The multi-player results are framed differently in the project page material. There, COSPLAY is reported as competitive on Avalon and Diplomacy, not category-leading. That is an honest framing from the authors and worth preserving when the result gets passed around. Strategy and social-deduction games stress different skills than a Tetris board, and the architecture's advantages compress when the environment is genuinely adversarial and partially observable.
One additional reported result is useful for sanity-checking. On MMLU-Pro and Math-500, the project page reports small drops versus the Qwen3-8B base. In other words, the co-evolution training does not appear to crater general reasoning. It also does not improve it. The takeaway is that COSPLAY is a procedural-skill story, not a general-reasoning story, and the authors do not pretend otherwise.
The portable idea is the architecture, not the games
If you build long-horizon agents for anything that is not a game, the temptation is to skim past this paper. That would be a mistake, but for a non-obvious reason. The interesting transferable artifact here is not the reward number on Tetris. It is the shape of the skill-bank pipeline.
Three design choices stand out:
- Skills are extracted, not authored. The skill-bank agent mines candidate skills from unlabeled rollouts the decision agent already produced. This is the part most production agent stacks get wrong: they require a human to sit down and write a tool, a prompt, or a sub-policy. COSPLAY treats the rollout log as the raw material.
- Skills carry explicit contracts. A skill is not a free-floating snippet of behavior. It has preconditions, expected inputs, and a description of when it applies. That is what makes it usable later without supervision.
- The two roles co-evolve rather than alternate. Acting and skill-extraction are not phases. They are continuous, cooperating processes. The decision agent benefits from the skill bank; the skill-bank agent benefits from richer rollouts.
None of this is a brand-new idea in the abstract sense. Procedural memory, hierarchical RL, and tool synthesis have all explored neighboring ground. What COSPLAY contributes is a clean, working instantiation of all three pressures at once on a base model that is not frontier-scale. That is the part builders should pay attention to.
What practitioners can borrow right now
This is where we have to be careful. The paper evaluates on games. Translating game-bench gains into enterprise-agent gains is not a proven result, and we are not going to pretend it is. The points below are inference, not benchmark-backed claims.
That said, the skill-bank shape itself is small enough to copy. If you run an agent loop in production today, you almost certainly already have the raw material:
- You have rollouts. Every completed task trace is a candidate skill source. Most teams throw these away after analytics. Stop doing that.
- You can mine them. A second agent, run offline, can scan recent successful traces and propose named skills with explicit input/output contracts. You do not need a custom RL stack to try this.
- You can gate adoption. Mined skills should not silently enter the live agent. They should be reviewed, contract-checked, and version-pinned before the decision agent is allowed to call them.
- You can measure regression. COSPLAY's MMLU-Pro and Math-500 deltas are a good reminder. Every time you add to the skill bank, re-run a small held-out task suite to confirm you have not damaged general behavior.
The honest version of the practitioner takeaway is this: the COSPLAY paper does not prove that skill banks work for your domain, but it provides the cleanest publicly described template we have seen for one, and the template is cheap enough to pilot.
What the results do not show
A few caveats, kept visible on purpose:
- This is a preprint. It has not been peer reviewed at the time of writing.
- Evaluation is game-centric. There is no enterprise, coding, or open-web agent benchmark in the reported set.
- The training stack is heavyweight. Co-evolving two agents is not a cheap fine-tune; teams without RL infrastructure will need to adapt the pattern, not the pipeline.
- Multi-player results are competitive, not dominant. Adversarial and partially observable settings still erode the architecture's advantages.
- Generalization beyond games is, at this point, inferred. The abstract and project page do not claim transfer to non-game agent tasks, and neither do we.
Verdict
Worth reading, worth piloting in narrow form, not worth overclaiming. COSPLAY's reported 25.1% average reward improvement on single-player games over four frontier LLM baselines is real but bounded: it sits on a Qwen3-8B base, it is game-centric, and multi-player gains are competitive rather than category-leading. The portable contribution is the skill-bank architecture itself: mine reusable skills from unlabeled rollouts, give them explicit contracts, and let the decision agent and the skill-extractor co-evolve. That pattern is cheap enough to prototype outside an RL lab, and it directly attacks the most common failure mode of production agents: solving the same sub-task forever, episode after episode, with nothing to show for it.
Want Better Agent Systems Without the Usual Chaos?
Our OpenClaw Beginner's Guide shows how to structure tools, memory, and automation so your agents stay useful under real load.
Get the Field Guide — $20 →