Here's something that should bother anyone building agents on top of reasoning models: the retrieval-augmented generation pipeline you've carefully tuned for your application might, in some settings, be making your reasoning model worse.
That's one of the more striking findings from “Procedural Knowledge at Scale Improves Reasoning”, an April 2026 preprint from Meta FAIR and UCLA. The paper introduces a system called Reasoning Memory that harvests millions of problem-solving recipes from existing reasoning traces and retrieves them mid-thought. The reported gains are real within the paper's setup. But the more interesting result is what it suggests about a mismatch between how we currently do retrieval and what some reasoning models may need.
Caveat: this is a preprint, not peer-reviewed work. The results have not been independently replicated. Treat it as a strong signal worth understanding, not settled science.
The Procedural Knowledge Problem
When a reasoning model like DeepSeek-R1-Distill-Llama-8B, OpenThinker3-7B, or Qwen3-32B works through a hard math problem, it generates a long chain of thought. Along the way, it discovers strategies: how to decompose a combinatorics problem, when to switch from algebraic manipulation to geometric reasoning, how to verify an intermediate result. These strategies are genuine procedural knowledge, the kind of meta-cognitive skill that experienced problem-solvers carry from one problem to the next.
Then the model finishes, returns the answer, and throws all of that away.
Every new problem starts from zero. The model re-derives strategies it has effectively already discovered. This is the core waste the paper identifies: reasoning models generate enormous amounts of procedural knowledge during inference, and none of it persists. Test-time compute scaling, the approach of simply letting models think longer, doesn't fix this. Longer thinking just means more time spent rediscovering strategies from scratch.
Why Document RAG Can Make Things Worse
If your instinct is “just use RAG to give the model useful context,” you're not wrong in spirit. But the paper's results suggest you might be wrong about what kind of context to retrieve.
Standard document-level RAG retrieves chunks of factual text that seem relevant to the input query. For a reasoning model mid-thought, this creates a misalignment. The model isn't stuck because it lacks facts. It's stuck because it hasn't found the right approach. Feeding it a Wikipedia paragraph or a textbook excerpt when it needs a problem-solving strategy is like handing a lost hiker an encyclopedia entry about forests when what they need is a compass bearing.
The paper tests document RAG as a baseline and finds that standard document-level retrieval can degrade performance relative to no retrieval in several tested settings, though not universally. This is a genuinely important result for practitioners. If you're building retrieval pipelines for agent systems that use reasoning models, you should be asking whether your retrieval is aligned with what the model actually needs at each point in its reasoning chain.
How Reasoning Memory Works
The system has three phases, and the design choices in each one matter.
Building the recipe database
The authors start with the Nemotron Post-Training Dataset, a corpus of reasoning trajectories covering math, science, and coding. For each trajectory, an LLM pipeline extracts the self-contained sub-questions the model was implicitly solving, then abstracts each one into a concise subroutine: a reusable recipe for how to approach that type of sub-problem. The result is 32 million sub-question/subroutine pairs. The average sub-question is about 19 tokens long. The average subroutine runs about 208 tokens. Each original trajectory yields roughly 10–11 sub-questions.
This decomposition is doing important work. It's not storing “here's how to solve Problem #4,271.” It's storing “when you encounter a sub-problem that looks like X, here's a procedural approach that has worked.” The grain size is the sub-question, not the full problem, which is what makes retrieval useful mid-thought.
In-thought active retrieval
At inference time, the system intercepts the model's thinking process. Using a technique the authors call “thought hijacking,” they prompt the model to verbalize the sub-question it's currently working on. That verbalization becomes a retrieval query against the 32M-pair datastore, using a specialized retriever called ReasonIR-8B. The top matching subroutines are injected directly into the model's thinking stream as hints, and the model continues reasoning with that procedural guidance in context.
This is retrieval happening inside the chain of thought, not before it. The model isn't getting a static context window stuffed with potentially relevant information. It's getting targeted procedural advice at the moment it's trying to solve a specific sub-problem.
Scaling through diversity
When the system has a compute budget for multiple samples, it allocates runs across different retrieved subroutines rather than concentrating all samples on a single strategy. This diversity-first approach is one of the paper's more practical insights. Exploring three different problem-solving strategies with ten samples each outperforms running thirty samples on a single strategy. The system scores trajectories by thinking length, where shorter traces correlate with higher confidence, and selects among them.
The Numbers, With Context
The headline results are strong, but they need qualification.
The maximum accuracy gain over no retrieval is up to 19.2%. The maximum gain over the strongest compute-matched baseline is up to 7.9%. Both of these are peak figures, not averages. They represent the best case across benchmarks and models.
At a sample budget of 30, the averaged gains across models are more grounded: roughly +12 percentage points on math benchmarks, +5 on science, and +9 on coding. Reasoning Memory beats length scaling in 31 out of 36 benchmark/model combinations, which is a strong consistency signal even if the magnitude varies.
The benchmarks are exclusively hard reasoning tasks: AIME 2024 and 2025, MATH500, GPQA-Diamond, and two LiveCodeBench slices. The models tested are DeepSeek-R1-Distill-Llama-8B, OpenThinker3-7B, and Qwen3-32B. These are capable open-weight reasoning models, but they do not establish that the same gains will hold at larger frontier scales or on closed models.
What This Paper Does Not Prove
It's worth being explicit about the boundaries of these results.
- This hasn't been tested on non-reasoning models. The entire approach is designed for models that produce extended chains of thought. Whether procedural retrieval helps standard instruction-following models is an open question.
- The benchmarks are narrow. Hard math, competition science, and coding problems are important but don't represent the full range of reasoning tasks people care about. Open-ended analysis, multi-step planning with ambiguous constraints, or reasoning over messy real-world data are all untested.
- Compute overhead is not fully characterized. Building and querying a 32 million pair datastore, running a retriever at each thought step, and generating multiple diverse trajectories all have real costs. The paper doesn't provide a complete serving-cost analysis, which matters for anyone thinking about production deployment.
- The uncertainty heuristic is a proxy. Using thinking trace length as a confidence signal is clever and seems to work empirically, but it's not a principled uncertainty estimate. There are likely failure modes where a short, confident trace is confidently wrong.
- No independent replication exists yet. This is a single group's result on their specific pipeline. The intuition is strong, but both the size and reliability of the gains still need replication.
Why Agent Builders Should Pay Attention
Even with those caveats, there are several reasons this paper matters for people building AI systems in production.
It reframes what retrieval should look like for reasoning agents. If you're building an agent that solves similar classes of problems repeatedly, such as debugging, data analysis, or code generation, you probably have a growing corpus of successful reasoning traces. The standard approach is to fine-tune on them or use them for few-shot prompting. This paper suggests a third path: decompose those traces into procedural sub-routines and retrieve them at the sub-problem level during inference. No fine-tuning required, and the knowledge base can grow continuously.
It makes the case for sub-question-level retrieval. Most RAG systems operate at the document or paragraph level. Reasoning Memory operates at the sub-question level, roughly 19 tokens for the query and 208 tokens for the retrieved content. For agent architectures that involve multi-step reasoning, this granularity might be closer to what's actually useful.
The diversity-first scaling result is immediately actionable. If you're already using best-of-N sampling with reasoning models, the finding that diverse procedural priors outperform single-strategy intensity is worth testing today. You don't need the full Reasoning Memory pipeline to explore whether varying the strategic framing across samples improves your results.
It suggests a new kind of organizational knowledge asset. Companies that run reasoning models at scale are generating enormous volumes of reasoning traces. Most of that procedural knowledge disappears. A decompose-and-index pipeline like the one described here could turn those traces into a durable, searchable knowledge base that improves over time. This is a concrete path toward systems that learn from their own inference history without weight updates.
The Practical Verdict
This paper's most durable contribution isn't the specific numbers. It's the argument that, for the evaluated tasks and models, procedural knowledge may matter more than factual knowledge at inference time. That's a claim with real architectural implications if it continues to hold.
If you're building retrieval systems for reasoning-heavy agents, the key takeaway is to examine whether your retrieval is aligned with what the model needs when it's stuck. Facts and documents answer “what.” Procedures answer “how.” For models that think step by step, “how” is usually the bottleneck.
The 32 million recipe datastore, the thought-hijacking retrieval mechanism, and the diversity-first scaling strategy are all interesting engineering choices. But the core insight is simpler: reasoning models appear to waste useful procedural knowledge, and this paper offers one plausible way to retain and reuse it.
Whether Reasoning Memory specifically becomes a standard component in agent architectures remains to be seen. But the direction it points toward, persistent procedural memory for inference-time reasoning, is one that agent builders should be thinking about now.
Build Smarter AI Systems With Alchemic Technology
We help teams design retrieval and agent architectures that actually align with how modern reasoning models work. If you're building systems that think, let's talk about making them think better.
Get the Field Guide — $10 →