Memory Architecture for Agents

Published: 2026-02-23 · 7 min read

Most teams get memory wrong in the same direction: they add more of it. The agent starts forgetting things, so they expand the context window. It starts making errors on old information, so they add more retrieval. The system gets slower, less coherent, and more expensive — and the underlying problem is never fixed.

The problem isn't that the agent needs more memory. It's that the memory it has isn't structured. Context rot, noisy recall, and constraint drift are three distinct failure modes. Each one has a specific fix. Throwing more context at all three makes all three worse.

The Three Failure Modes (Named)

arXiv 2601.11653 (January 2026) gives these failure modes formal names and defines them precisely. Worth quoting directly:

"As interactions grow, agent behavior often degrades due to loss of constraint focus, error accumulation, and memory-induced drift. These approaches introduce unbounded context growth and are vulnerable to noisy recall and memory poisoning, leading to unstable behavior and increased drift."

The three named failure modes:

Context rot — enlarging the context window without management degrades performance non-linearly. Simply loading everything the agent might ever need into context worked in demos. It collapsed under production workloads. Retrieval became expensive, earlier instructions degraded, costs compounded.
Noisy recall / memory poisoning — when agents retrieve memories by vector similarity, they surface contextually similar but factually irrelevant content. The more memories you accumulate, the worse the signal-to-noise ratio. The agent starts reasoning from retrieved content that sounds relevant but isn't.
Constraint drift — over long interactions, agents gradually lose track of behavioral rules loaded earlier in the session. Not because they forgot the rules — because new context pushed them out. Rules that were followed at session start are violated by session end.

The dominant fix teams try — persistent memory via transcript replay — makes all three failure modes worse. More context means more rot, more noise, more drift. The solution is architecture, not volume.

What the Benchmarks Say

Three memory architectures have accumulated enough production data to compare:

Vector store (retrieval by similarity) — fast, scalable, surface-level recall. Finds things that sound related, not necessarily things that are related. Best for large knowledge bases where fuzzy retrieval is acceptable.

Summarization (rolling compression) — periodic condensation of conversation history into structured summaries. Cheaper, more stable, but lossy. Best for operational context where recency matters more than completeness.

Knowledge graph (structured relationships) — memories stored as typed nodes and relationships: people, decisions, events, commitments, time. "Who said what about whom and when." Most expensive to maintain, highest accuracy over long horizons.

Production benchmarks from Zep's temporal knowledge graph deployment: 18.5% accuracy improvement on long-horizon tasks, 90% latency reduction compared to raw retrieval. Mem0's structured summarization approach shows a 26% accuracy gain over unstructured transcript replay with significant token cost reduction.

Separate from both, a finding from the ClawVault architecture study is worth noting: plain markdown files with typed metadata (74.0% task accuracy) outperformed specialized memory tooling (68.5% accuracy) on the LoCoMo long-context benchmark. LLMs are already trained on markdown. They know how to work with it. Complex retrieval infrastructure isn't always the answer.

The Architecture That Works in Practice

The system we run and recommend to clients is a three-tier structure. Not three tools — three tiers with different jobs:

Tier 1: Always-loaded memory. A single distilled file — hard-capped at 3,500 characters — containing stable facts, active project summaries, critical decisions with rationale, and standing preferences. This loads every session, first, unconditionally. It never grows unbounded. Anything that would push it over the cap gets demoted to Tier 2 or archived.

Tier 2: Typed memory files. Structured notes organized by type: decisions, people, lessons, commitments, preferences, projects. Each file has YAML frontmatter (date, category, priority tag). A vault index file — a single document with every note plus a one-line description — lets the agent scan the full memory catalog without loading everything into context. The agent reads the index first, then fetches only what's relevant.

Tier 3: Semantic recall. Hybrid search (keyword + embedding) with reranking for long-tail retrieval. This is the fallback for queries the vault index can't resolve, not the default lookup. Most sessions don't touch it.

Priority tagging threads through all three tiers: 🔴 critical (decisions, commitments, blockers), 🟡 notable (insights, preferences, live context), 🟢 background (routine updates, low-signal). When loading context under a token budget, the agent loads red first, yellow next, green if there's room. Low-signal background notes never crowd out critical context.

Decay Rules

Memory without decay is an archive, not a working system. Information that hasn't been referenced in 30+ days should be archived or removed from active tiers. This isn't about pruning for its own sake — stale context actively drives bad decisions. An agent reasoning from a project status that was accurate three weeks ago will make confident errors.

The practical implementation: a weekly consolidation job that promotes high-signal observations from daily logs into typed memory files, demotes unreferenced notes to archive, and flags anything that contradicts current runtime state. The job doesn't require human review on every cycle — only on items flagged as contradictions.

Memory Is Not Runtime Truth

The most important operational rule in any memory architecture is this: for any claim about system state, live checks beat memory every time.

Memory records what was true when it was written. Running processes, active cron jobs, model assignments, queue state — these can change between when a note was written and when it's read. An agent that reports status from memory without running a live check is reporting what used to be true. In an operational context, that's worse than saying nothing.

The rule: before making any claim about current system state, verify it live. If live verification isn't possible, the claim is "unconfirmed." Not "working." Not "fixed."

The Gap Most Teams Miss: Entity Relationships

Most agent memory systems have typed files — people, decisions, projects — but no machine-readable relationships between them. A people file knows about a person. A decisions file knows about a decision. Neither knows about the connection between them.

This becomes a real limitation at scale. "Find all decisions related to this project" requires either a full-scan search or a knowledge graph layer. Semantic search partially covers it — it can find things that sound related — but it can't traverse: "who was involved in this decision, and what commitments came out of that conversation?"

This is the long-horizon upgrade path: entity relationship mapping on top of typed memory files. It doesn't require abandoning markdown. The ClawVault pattern implements it with wiki-style links inside notes — [[entity-name]] — that a tool can resolve into a traversable graph. The underlying files stay human-readable.

Automated Lesson Extraction

arXiv 2601.22758 (AutoRefine) formalizes a pattern that most teams do manually, if at all: extracting reusable expertise from agent execution histories. The research shows that agents whose lessons are drawn from their own failure patterns — not just manually filed observations — outperform agents whose knowledge base is manually curated.

The practical gap: if a task fails and the failure isn't surfaced in the conversation log, no lesson gets written. The agent hits the same wall next session. The fix is an automated extraction pass over daily logs: pattern detection that identifies failures, retries, and unexpected outcomes, and writes them to the lessons file without waiting for a human to flag them.

We haven't fully automated this yet. The architecture is in place; the trigger logic is on the roadmap. In the meantime, daily logs are reviewed manually for lesson candidates at the end of each session.

Bottom Line

Memory architecture is the most underbuilt part of most agent deployments. Teams spend weeks tuning prompts and zero hours designing how information persists, decays, and gets retrieved. The result is systems that work well in short sessions and degrade predictably over time.

The architecture above — tiered memory, typed files, vault index, priority loading, decay rules, and live-verification-over-memory — costs one week to implement and produces measurable improvements in session coherence, token efficiency, and behavioral reliability over long horizons. The research backs it. The production numbers back it.

Build the memory system before you need it. Retrofitting it after drift sets in is harder than building it right.

— Ridley Research & Consulting, February 2026