Every AI agent has the same memory problem. Give it a long conversation history, a pile of retrieved documents, and a complex task — and you're burning through context window tokens just to remind the agent what it already knows. Current approaches boil down to two options: stuff the text into the prompt (expensive, limited), or fine-tune the model on new facts (catastrophic forgetting, even more expensive). Neither scales well.

A new paper from researchers at Renmin University of China introduces NextMem, a framework that takes a third path: compress factual memories into compact latent vectors that an LLM can read directly, then decompress them back into the original text when needed — with surprisingly high accuracy.

The Memory Trilemma

LLM-based agents need to remember things. The information from step 5 of a task might not be useful until step 50. Current memory architectures handle this in two ways, both with significant trade-offs:

Textual memory stores facts as text, retrieves them via embedding similarity, and injects them into the prompt. It works, but every retrieved passage eats into your context window. For an agent managing hundreds of observations across a long task, context pressure becomes the bottleneck.

Parametric memory bakes new information directly into model weights — essentially fine-tuning on the fly. The problem is catastrophic forgetting: new facts overwrite old ones. And the compute cost of modifying weights for every new observation is prohibitive for real-time agent workflows.

NextMem proposes a third category: latent memory. Encode text into a small number of dense vectors. Store those vectors. When you need the information back, decode the vectors into the original text. The compression ratio is dramatic — 128 tokens of text become 15 latent tokens — and the reconstruction quality is high enough to be practically useful.

How NextMem Works

The architecture is an autoregressive autoencoder built on top of a standard causal language model (they use Qwen3-8B as the backbone). The encoder and decoder share the same base weights but use separate LoRA adapters, so you can switch between encoding and decoding by swapping adapters — no need to load two full models.

The encoding process works like this:

  1. Take the input text and append a special [SoD] (Start of Decode) token
  2. Run it through the encoder and extract the hidden state at the last position — that's your first latent token
  3. Append that latent token to the input and run the encoder again to get the second latent token
  4. Repeat until you have your target number of latent tokens (15 in their experiments)

Decoding reverses the process: feed the latent tokens into the decoder, and it autoregressively generates the original text.

The clever part is the training. Getting this to work requires a two-stage process that progressively teaches the encoder to produce useful latent representations.

Two-Stage Training: The Key Insight

Stage 1: Autoregressive Reconstruction Alignment. Before teaching the model to work with latent vectors, they first train it to copy text through itself. The input is the original text followed by [SoD] followed by the same text again. The model learns to reconstruct the input given the prefix — essentially training a text-to-text autoencoder. This establishes the decoder's ability to faithfully reproduce sequences.

Stage 2: Progressive Latent Substitution. This is where it gets interesting. They gradually replace chunks of the text prefix with latent vectors, forcing the model to reconstruct the missing text from the latent representations instead of from the original tokens. The substitution is progressive — first replace one chunk, train until stable, then replace two chunks, and so on — until the entire prefix is replaced by latent vectors.

💡 Why progressive? If you try to go straight from full text to full latent representation, the training collapses. The progressive approach creates a curriculum where each step only slightly increases the difficulty, and the encoder parameters from the previous step initialize the next. The ablation study confirms this: removing the progressive strategy drops F1 from 0.855 to 0.739.

During Stage 2, the decoder is frozen — only the encoder learns. This forces the encoder to produce representations that are compatible with the already-trained decoder, rather than allowing the two to co-adapt in ways that might not generalize.

Quantization: Making It Practical

Even with the compression from 128 text tokens to 15 latent tokens, storing high-precision floating-point vectors for every memory adds up. NextMem applies NF4 (4-bit NormalFloat) quantization to the latent representations, dropping each value from full precision to just 4 bits.

The results here are striking: quantization causes negligible performance loss. Their sparse (quantized) model scores almost identically to the dense version across all metrics. This makes sense given the robustness experiments — they found that adding Gaussian noise with standard deviation up to 0.8 barely affects reconstruction quality. The latent space is robust enough that aggressive quantization doesn't destroy the signal.

The Results

NextMem was evaluated across three tasks that map to the core memory operations of an agent system:

Factual Reconstruction (Memory Storage): Encode text, decode it back, measure how close the output is to the original. NextMem-Dense hits F1 scores of 0.83–0.86 across SQuAD, HotpotQA, RACE, LoCoMo, and LongMemEval. The quantized version (NextMem-Sparse) matches almost exactly. Both significantly outperform baselines like ICAE (context compression) and DyPRAG (parametric knowledge injection).

Contextual Generation (Memory Utilization): Give the model latent memories instead of raw text and ask it to answer questions. In the decompression setting (decode latents back to text, then answer), NextMem outperforms all baselines. In the direct compression setting (answer directly from latent vectors), ICAE has an edge — suggesting NextMem's latent space is optimized more for faithful storage than for direct reasoning.

Dense Passage Retrieval (Memory Retrieval): Use the latent representations as retrieval embeddings. Pool them into a single vector, compute cosine similarity with query embeddings, rank results. NextMem substantially outperforms other reconstruction-capable models and approaches the quality of BGE (a dedicated retrieval embedding model that can't reconstruct).

This last result is worth highlighting: the same latent representation that stores and reconstructs factual content also works as a retrieval index. That's a meaningful architectural simplification — you don't need separate systems for memory storage and memory retrieval.

The Semantic Structure Inside

One of the more interesting findings is the semantic assignment analysis. The researchers took a paragraph with eight sentences, systematically swapped entities in each sentence, and measured which latent tokens changed the most. The result shows a clear diagonal pattern — latent token 1 primarily encodes information from the first part of the text, token 2 from the second part, and so on.

This ordered spatial-semantic mapping is significant because it means the latent space isn't just a jumbled bag of features. The information is stored in a structured way that mirrors the sequential order of the original text. The researchers note this property could enable fine-grained memory editing — changing specific facts in the latent representation without re-encoding the entire text.

Limitations and Open Questions

NextMem isn't a solved problem. A few clear limitations emerge from the paper:

  • Compression ratio vs. quality trade-off. Performance degrades as input length increases relative to the number of latent tokens. At very high compression ratios, the model starts hallucinating filler content. They show graceful degradation, but there's a ceiling.
  • Direct utilization gap. The latent representations work well when decompressed back to text, but underperform ICAE when used directly for inference. There's a tension between optimizing for faithful reconstruction and optimizing for direct reasoning from compressed representations.
  • Scale. Experiments use Qwen3-8B as the backbone and 128-token chunks. How this behaves at larger model scales and with longer documents is an open question.
  • Integration. The framework is evaluated in isolation. Plugging it into a full agent loop with dynamic encoding, retrieval, and multi-step reasoning would be the real test.

Why This Matters for AI Agents

The context window is the fundamental bottleneck for long-running AI agents. Every token spent on memory retrieval is a token not available for reasoning, planning, or tool use. Current approaches — RAG, summarization, sliding windows — are all approximations that trade fidelity for efficiency.

NextMem offers a different trade-off: compress the information into a form that's both compact and reversible. An agent could maintain a library of quantized latent memories, retrieve the relevant ones via the same embeddings, decompress only what's needed, and reclaim the context window space for actual work.

We're not there yet — the direct utilization gap means decompression is still required, adding latency. But the foundation is promising. If future work can close the gap between reconstruction quality and direct reasoning from latent representations, this could fundamentally change how agent memory systems are architected.

Links

NextMem tackles a real infrastructure problem for AI agents: how do you store detailed factual memories without drowning in context tokens or destroying old knowledge? The autoregressive autoencoder approach — compress to 15 latent tokens, reconstruct with 0.85+ F1, retrieve with the same embeddings — is elegant and practical. The progressive training curriculum is the key innovation that makes it work. The direct utilization gap is a clear next step, but even in its current decompression-required form, this is a meaningful step toward agents that can remember more while thinking in less space.

Building AI agents that need to remember?

The OpenClaw Field Guide covers agent memory, context management, and production workflows — the practical guide for getting AI agents running reliably.

Get the Guide — $10 →