Memory Worth: A Lightweight Metric for AI Agent Memory Governance

Every serious agent-memory project obsesses over what to store and how to retrieve it. Much fewer teams have a principled answer to the third question that matters just as much, when a memory should be forgotten.

That gap shows up fast in production systems. Memory stores accumulate tool outputs, user preferences, retrieved documents, and older reasoning traces. Some of that context stays useful. Some of it goes stale, becomes misleading, or was only marginally helpful in the first place. Without a governance layer, retrieval quality quietly degrades and the agent starts anchoring on context that no longer reflects reality.

A recent arXiv preprint by Baris Simsek, When to Forget: A Memory Governance Primitive (arXiv:2604.12007), takes a direct shot at that problem. It introduces Memory Worth, or MW, a lightweight per-memory signal meant to estimate whether a memory is helping successful outcomes or showing up mostly when the system fails.

Important framing: this is a preprint, not peer-reviewed work, and the main validation is synthetic. The idea is promising, but the evidence is still early.

The forgetting problem few systems address directly

Most memory work in agent systems lives on the write path and the read path. Teams debate chunking, embeddings, metadata, rerankers, and retrieval thresholds. The governance layer, deciding which memories are still earning their keep, is often manual, ad hoc, or absent altogether.

That is a real operational weakness. In a top-k retrieval system, a stale memory can still look semantically relevant and win a slot, even when its content is outdated or consistently associated with bad outcomes. Similarity alone cannot tell you whether a memory is actually useful.

What Memory Worth actually is

MW is intentionally simple. For each memory, the system tracks two counters:

hits+, how often that memory was retrieved during episodes that ended in success.
hits−, how often it was retrieved during episodes that ended in failure.

The score is then:

MW(m) = hits+ / (hits+ + hits−)

No learned model. No extra embedding computation. Just two integers and a ratio.

The intuition is straightforward. Memories that show up mostly in successful episodes trend toward 1.0. Memories that show up mostly in failures trend toward 0.0. Memories that appear across both sit closer to the base rate.

Why the metric matters

What makes MW interesting is that it measures empirical outcome association, not semantic similarity. That gives it access to a failure mode ordinary retrieval scoring cannot catch. A memory can be on-topic and still be harmful if it is stale, misleading, or attached to workflows that no longer work.

That makes the practical uses pretty clear:

Staleness detection, where low-MW memories get flagged for review.
Retrieval suppression, where harmful memories are down-ranked or excluded without immediate deletion.
Deprecation decisions, where persistently weak memories are archived or removed.

What the experiments show

The main evaluation happens in a synthetic environment where each memory has a known ground-truth utility. That lets the paper measure whether MW tracks true usefulness instead of just sounding plausible.

The headline result is strong in that controlled setting. After 10,000 episodes across 20 random seeds, MW reaches a Spearman rank correlation of 0.89 ± 0.02 with true memory utility. A baseline with no counter updates stays at 0.00, as expected.

The paper also compares different ways of assigning retrieval credit, including uniform weighting, score-proportional weighting, and an oracle scheme. All three converge to similar long-run performance, which suggests the signal is not hypersensitive to the exact credit assignment rule.

There is also a smaller retrieval-realistic experiment using all-MiniLM-L6-v2 embeddings. In that setup, a deliberately stale memory ends up at MW = 0.17, while a genuinely useful specialist memory scores MW = 0.77. That is the kind of separation practitioners actually want, even if the experiment is still limited in scale.

What the paper does not prove

The paper is careful enough here, and builders should be too.

MW is associational, not causal. A memory with high MW co-occurs with success. That does not mean it caused the success.
The convergence result depends on non-trivial assumptions. The paper's A1-A6 assumptions include conditional independence, stationarity, and retrieval behavior that real deployments often violate.
The evidence is mostly synthetic. The controlled environment is useful for isolating signal, but it is not deployment-scale proof.
This is still a preprint. The results should be treated with appropriate caution.

Practical takeaways for agent builders

The good news is that MW is cheap to test. If your system already logs which memories were retrieved and whether the episode succeeded or failed, you can layer this signal on top of your existing memory stack without redesigning retrieval.

The sensible first use is not hard deletion. It is review and suppression. Let MW help you find memories that may have gone stale, then confirm with human oversight or additional policy checks.

If your environment shifts quickly, you should also assume the raw counters may lag reality. Windowed counters or decay-weighted versions are obvious extensions for non-stationary domains, even though the paper does not validate them directly.

Our take

Memory governance is a real missing layer in agent infrastructure. Most teams discover it only after the memory store grows large enough that retrieval quality starts slipping. Memory Worth is a clean, minimal proposal for that layer: two counters, one interpretable score, and a practical path to staleness detection.

The caveats matter. Synthetic validation is not production validation, and associational signals can mislead. But the framing is right, and the implementation cost is low enough that many teams should at least prototype it.

If you are building agent memory, this paper is worth thirty minutes. Not because it closes the problem, but because it gives the governance conversation a concrete primitive to build around.

Building stateful AI systems?

Our OpenClaw field guide covers memory architecture, retrieval design, and production safeguards for agents that need to stay useful over time.

Get the Field Guide — $10 →

Memory Worth: A Lightweight Metric for AI Agent Memory Governance

The forgetting problem few systems address directly

What Memory Worth actually is

Why the metric matters

What the experiments show

What the paper does not prove

Practical takeaways for agent builders

Our take

Building stateful AI systems?

Keep Reading