Your AI Agent's Memory Can Be Poisoned — Here's How to Defend It

If your AI agent remembers things between sessions, it has a memory system. And if that memory system is cloud-based, it has a target on its back. A February 2026 paper from Varun Pratap Bhardwaj — "Privacy-Preserving Multi-Agent Memory with Bayesian Trust Defense Against Memory Poisoning" — lays out the problem and proposes a concrete, open-source fix.

We read the full paper. Here's what matters, what works, and where it falls short.

The Problem: Persistent Memory, Persistent Vulnerabilities

AI agents are getting memory. Claude, ChatGPT, and Gemini all let their models retain information across conversations. Third-party systems like Mem0, MemOS, and Letta provide memory-as-a-service for agent frameworks. This memory makes agents dramatically more useful — they learn your preferences, remember project context, and build on prior decisions.

It also creates a new attack surface.

The OWASP Top 10 for Agentic AI (published 2025) flags memory poisoning as threat ASI06 — one of the ten most critical risks facing deployed AI agents. Unlike prompt injection, which dies when the conversation ends, poisoned memories persist. They influence every future decision the agent makes.

This isn't theoretical. The paper cites three real-world attacks:

The Gemini Memory Exploit — delayed tool invocation that persisted malicious instructions across sessions
Calendar invite poisoning — a 73% success rate across 14 tested scenarios, rated high-critical severity
The Lakera "sleeper agent" injection — agents developed persistent false beliefs about security policies after targeted memory manipulation

All three worked against production systems with real users.

Why Cloud Memory Makes It Worse

The paper argues that cloud-based memory architectures amplify the risk in four ways:

Multi-tenant exposure. Shared infrastructure means one compromised agent's poisoned memories can leak to other users on the same platform.
Network exposure. Memory content travels over the wire, where it's vulnerable even with TLS (compromised infra, certificate attacks).
Opaque provenance. You can't independently verify who wrote what to your agent's memory. The cloud provider controls the audit logs.
Vendor lock-in. You can't export and independently verify memory integrity. If something looks wrong, you're stuck with the provider's tools.

The proposed solution is architectural: keep everything local.

SuperLocalMemory: The Architecture

SuperLocalMemory is a four-layer stack. Each layer adds capability on top of the previous one, and if any layer fails, the system degrades gracefully to the layer below it.

Layer 1: Storage Engine

SQLite with FTS5 full-text search. WAL (Write-Ahead Logging) for concurrent read access, a serialized write queue, and connection pooling. Each memory record stores content, tags, importance score, timestamps, and an optional entity vector. That's it — zero external dependencies for the base layer.

Layer 2: Hierarchical Index

A materialized path scheme for parent-child relationships between memories. Query "this memory and all its sub-memories" in O(1) parent lookup. Think: project-scoped memory trees.

Layer 3: Knowledge Graph

TF-IDF key-term extraction, pairwise cosine similarity for edges (threshold >0.3), and Leiden algorithm community detection with subclustering to depth 3. The brute-force edge computation is O(n²) — the system caps graph construction at 10,000 memories and includes an optional HNSW index to bring it down to O(n log n).

Layer 4: Pattern Learning

A Beta-Binomial Bayesian model tracks user preferences across 8 technology categories. Confidence is clamped to [0, 0.95] to prevent overconfidence on limited data. No LLM calls required — this is pure statistical learning.

Key design decision: The entire base system runs on Python's standard library (sqlite3, json, hashlib, re, datetime). Zero pip installs for core operation. Optional layers add scikit-learn for TF-IDF and python-igraph + leidenalg for graph clustering.

The Trust Scoring Framework

This is the paper's core contribution. Every agent interacting with the memory system gets a trust score, starting at 1.0. Trust evolves based on behavioral signals with a decay coefficient that makes early signals count more while accumulated history resists rapid change.

The key design choice: negative signals carry larger magnitude than positive ones. Trust is harder to earn than to lose.

Positive signals: verified recall (+0.015), consistent writes (+0.01), low error rate (+0.02)
Negative signals: contradictory writes (-0.02), flagged content (-0.03), anomalous burst activity (-0.025)

When an agent's trust drops below a configurable threshold (default 0.3), it gets blocked from write and delete operations entirely. Not soft-blocked — hard enforcement.

Every memory also records full provenance: who created it, what protocol they used (MCP, CLI, REST), what their trust score was at write time, and a complete modification chain with timestamps and agent IDs. You can forensically isolate every memory touched by a specific agent.

Defending Against Sleeper Agents

The hardest attack to detect: an agent behaves perfectly for weeks, builds up high trust, then starts injecting poisoned memories. The trust model handles this through the Beta posterior — early good behavior gets absorbed into the α parameter, but accumulated negative signals during the poisoning phase grow β until the posterior mean collapses.

In evaluation: 72.4% trust degradation in the sleeper scenario (trust dropped from 0.902 to 0.249), crossing the enforcement threshold. The attacker gets locked out.

Adaptive Re-Ranking: Learning Without LLMs

The paper identifies an honest problem in their own system: while the knowledge graph and pattern layers add structural value, they don't actually improve search ranking. The base FTS5 search achieves 0.90 MRR (first relevant result at rank 1 for 90% of queries), and adding layers 3-4 doesn't change that number.

Their solution is an adaptive learning-to-rank layer that re-ranks search results based on learned user preferences — without any LLM inference calls. It works in three phases:

Phase 0 (baseline): Under 20 feedback signals — results returned unchanged. No risk of degradation.
Phase 1 (rule-based): 20-199 signals — deterministic boost multipliers based on a 9-dimensional feature vector (BM25 score, TF-IDF similarity, technology match, project context, workflow fit, source quality, importance, recency, access frequency).
Phase 2 (ML): 200+ signals across 50+ unique queries — a gradient boosted decision tree trained with LambdaRank on real feedback data.

The result: 104% improvement in NDCG@5 with rule-based re-ranking, at a cost of 20ms additional latency. The system learns what you care about and surfaces it first.

The Numbers

Metric	Result	Notes
Median search latency	10.6ms	At 100 memories (typical personal DB)
Storage efficiency	1.4KB/memory	At scale (10K memories = 13.6MB)
Concurrency	0 errors	Under 10 simultaneous agents
Trust separation gap	0.90	Between benign and malicious agents
Sleeper attack detection	72.4% degradation	Trust 0.902 → 0.249
NDCG@5 improvement	+104%	With adaptive re-ranking enabled
MRR (human-judged pilot)	0.70	20 queries, 70 relevance judgments
Peak write throughput	220 writes/sec	At 2 concurrent agents

One thing worth noting: the NDCG evaluation has a circularity issue the authors openly acknowledge. The relevance labels used for scoring are derived from the system's own importance scores, which the adaptive ranker has access to as a feature. They partially address this with a human pilot study (MRR 0.70, NDCG@5 0.90 from a real developer), but it's a single user with 182 memories. Not exactly a large-scale validation.

What's Missing

The paper is refreshingly honest about its limitations. Here's what stood out to us:

Cold start is real. Adaptive re-ranking needs 20+ feedback signals to even start. The ML phase needs 200+ signals across 50+ queries. Until then, you're running on base FTS5 search — which is decent (0.90 MRR) but not personalized.
Single-user pilot. The human evaluation covers one developer over 3 months. That's a proof of concept, not a validation study.
SQLite write scaling. At 10 concurrent writing agents, throughput drops to 25 ops/sec with P95 latency hitting 754ms. The sweet spot is 1-2 writers. For larger multi-agent setups, this is a bottleneck.
No standard benchmarks. The system hasn't been tested on LoCoMo or other established memory benchmarks. The authors argue their use case (developer workflow memory) is fundamentally different from conversational memory — a fair point, but it makes comparison harder.
Graph construction at scale. 5,000 memories takes 4.6 minutes for a full graph build. The 10K cap exists for a reason.
Trust doesn't feed ranking. Trust scores block low-trust agents from writing, but they don't influence search ranking. A memory written by a highly-trusted agent ranks the same as one from a barely-trusted agent. The authors flag this as future work.

Why This Matters for Production Agents

If you're running AI agents that persist memory — and in 2026, most serious agent deployments do — you should be thinking about memory security. The current landscape is mostly "trust the cloud provider." That works until it doesn't.

SuperLocalMemory's approach is opinionated: local-first, zero cloud, full provenance, hard trust enforcement. That trades convenience for security. You lose cross-device sync, you lose managed infrastructure, and you lose the ecosystem effects of cloud platforms. You gain auditability, isolation, and defense against an attack class that most memory systems don't even acknowledge.

The Bayesian trust model is the most practical contribution here. The idea that agents should earn write access through consistent behavior, with asymmetric penalties for suspicious activity, is something any memory system could adopt — cloud or local. The provenance chain (who wrote what, when, with what trust level) should be table stakes for production memory.

The Verdict

SuperLocalMemory is a solid first step toward trust-defended AI memory. The architecture is clean, the threat model is grounded in real attacks, and the paper is unusually honest about what doesn't work yet. The 10.6ms search latency, zero-dependency core, and MCP integration with 17+ tools make it practical for developer workflows. The main gaps — limited user validation, SQLite write scaling, and no standard benchmarks — are solvable engineering problems, not fundamental flaws.

If you're building multi-agent systems and memory security matters to you, this is worth reading. The code is MIT-licensed on GitHub.

Building Secure AI Agent Systems?

Our OpenClaw Field Guide covers agent architecture, memory management, and security hardening for production deployments.

Get the Field Guide — $10 →

Your AI Agent's Memory Can Be Poisoned — Here's How to Defend It

The Problem: Persistent Memory, Persistent Vulnerabilities

Why Cloud Memory Makes It Worse

SuperLocalMemory: The Architecture

Layer 1: Storage Engine

Layer 2: Hierarchical Index

Layer 3: Knowledge Graph

Layer 4: Pattern Learning

The Trust Scoring Framework

Defending Against Sleeper Agents

Adaptive Re-Ranking: Learning Without LLMs

The Numbers

What's Missing

Why This Matters for Production Agents

The Verdict

Building Secure AI Agent Systems?

Keep Reading