← All posts · 2026-04-28
What persistent AI memory actually requires (and why bolt-on solutions fail)
Most "memory layer" products you can buy today are wrappers around a single piece of persistent memory — usually a vector database with some embedding logic. The pitch is "drop this in front of any LLM and it'll remember." The result, in practice, is a memory that catches keyword matches and misses anything more nuanced than that.
This post is about the four moving parts that real persistent memory needs, and what goes wrong when one of them is missing or weak.
If you haven't read the conceptual ground first, Context window vs memory vs persistence is the prerequisite. The terms below assume you know the difference.
The four parts
- Capture — extract structured units from every conversation
- Storage — index them by semantics, time, entity, relationship
- Retrieval — find the relevant slice on demand
- Curation — keep the layer healthy as it grows
Each one is its own engineering surface. Each one fails differently when missed. Bolt-on memory products typically nail 1-2 and skip the rest, which is why they feel useful for a week and useless by month two.
1. Capture
The model said something. The user said something. What's the unit you save?
Naive capture: save the full conversation as text. Run an embedding over the whole thing. Hope retrieval can find the right pieces later.
This breaks for several reasons:
- Granularity. A 30-message conversation contains maybe 8 distinct facts, 3 decisions, 2 contradictions, and 17 turns of clarifying text. Embedding the whole conversation as one chunk averages all of that into noise.
- Trust asymmetry. The user's stated preferences are higher-confidence than the assistant's claimed inferences. If you store both with the same weight, the assistant's hedged guesses end up overruling what the user actually said.
- Type structure. A "user owns a Pajero" fact behaves differently from a "user is considering buying a Vanquish" preference. Treat them the same and recall blurs.
Real capture: a scribe agent that reads each exchange and extracts typed units — facts, preferences, decisions, findings, observations — with confidence scores, source pointers, and entity links. The scribe knows the difference between "user said X" and "assistant said X" and tags accordingly. The output is structured semantic units, not raw text.
This is the layer where bolt-on solutions typically stop. They embed the chat log and call it done. Granularity and type structure get lost in the embedding average.
2. Storage
Now you have units. How do you store them so retrieval can find what's needed?
Naive storage: vector embedding only. Embed each unit, store in pgvector or Pinecone, search by cosine similarity at query time.
This works for the easy cases (semantic match on stable facts) and fails for the hard ones:
- Time. "What did I say about X last month?" — pure semantic search has no concept of recency. The unit you said yesterday and the unit you said in 2024 score equally.
- Entity recall. "What do you remember about Sarah?" — semantic search returns chunks that mention Sarah. It can't distinguish "Sarah is my co-founder" (high signal) from "I had lunch with Sarah" (low signal in the context of a co-founder question).
- Cross-references. A decision in one conversation references a constraint set in another. Semantic similarity doesn't traverse references; it just finds nearby vectors.
Real storage: a knowledge graph where each unit is indexed by embedding, entity links, timestamp, source conversation, confidence, and relationship to other units. Postgres + pgvector for the embedding layer, plus relational tables for the graph structure. The query layer can reason across all of those axes.
The memory of a long-term user gets to millions of units in months. Vector-only storage falls apart at this scale. The graph structure is what keeps recall sharp at 7M+ tokens of conversation history.
3. Retrieval
You have units stored. The user says something new. What do you surface?
Naive retrieval: embed the query, find the top-N nearest units by cosine similarity, prepend them to the model's context.
The problems:
- Single-hop fails on long chains. "What did we decide about pricing after the user research came back?" — this needs three hops: find the user research, find the pricing discussion, find the decision. A single similarity search returns either pricing chunks or research chunks, not the chain.
- Recency vs relevance. A user who's been running a project for a year has dozens of decisions on related topics. Pure similarity surfaces the oldest match alongside the most recent. The model gets confused about which is current.
- Topic drift. When a conversation moves from finance to lawn care, retrieval needs to follow. A static "always retrieve based on the user's first message" approach fails as the topic shifts mid-conversation.
Real retrieval: multi-hop planning when the query needs it — we call this TRACE — recency-weighted ranking by default, and topic-aware filtering as the conversation evolves. Different shapes of question route differently. A fact lookup ("when did I last call her?") goes one way. An evolution question ("how has my financial situation changed?") triggers TRACE, which decomposes into 5+ sub-queries that each retrieve their own slice and feed the synthesis. A current-state question routes a third way. The retrieval strategy adapts to the shape of what you're actually asking.
A useful test: ask the system "how has my situation evolved over time?" If it returns a coherent narrative, retrieval is working. If it returns a list of unrelated chunks, retrieval is just doing similarity search.
4. Curation
The first three parts get you a working memory layer at one month. By month four, without curation, it's full of contradictions and noise.
Why curation matters:
- Contradictions accumulate. Users change their minds. "I'm building a SaaS" → "actually I pivoted to consumer". Without resolution, both units coexist forever and retrieval has no way to pick the current one.
- Stale facts get retrieved. A preference you stated in 2024 isn't relevant in 2026. Pure recency-weighting loses signal; smarter decay is needed.
- Drift compounds. Each conversation slightly reshapes what the memory thinks it knows. Without active correction, small drifts in week one become wholesale misunderstanding by week twelve.
- Noise crowds out signal. Off-hand mentions, throwaway lines, and assistant paraphrases all generate units. Without pruning, the high-signal units get drowned out.
Real curation: a curator agent that runs after each scribe pass and on a periodic batch schedule. Resolves direct contradictions ("user is in Brisbane" → "user moved to Cairns"). Decays old units with a half-life that scales with type (preferences last longer than situational notes). Prunes redundancies. Promotes promotion-worthy units (a recurring fact across many conversations gets weighted higher).
This is the layer most bolt-on memory products don't build at all. The vector database doesn't care about contradictions; it just stores everything.
What goes wrong when one part is missing
Each missing part has a recognisable failure shape:
| Missing | Symptom | |---|---| | Capture | Memory is a black box of chat logs. You can't browse what's stored. Retrieval is keyword-only. | | Storage | Recall works for short histories then collapses. Memory feels useful at month 1, useless at month 6. | | Retrieval | The system "has" the right answer but doesn't surface it. Frustration: "you literally remembered this last week, why don't you remember now?" | | Curation | Memory contradicts itself. Old facts override new ones. Pruning goes wrong direction. |
If you've used a memory product and felt one of these symptoms, you can usually diagnose which layer it's missing.
Why labs aren't shipping this
Building all four well is closer to a search-engine company than an LLM company. The frontier labs are LLM-shaped — their R&D dollars go to model training, evals, alignment. The memory layer is adjacent and not their core competency, so they ship the small version (memory features prepended to context) and call it done.
We covered why in Why your AI keeps forgetting — the structural reasons memory isn't on the roadmap.
What Moss is
Moss is the four-part stack:
- Capture: scribe agent extracts typed semantic units from every exchange, with confidence scores and source links.
- Storage: Postgres + pgvector + entity graph. Indexed by embedding, time, entity, relationship.
- Retrieval: topic-aware, multi-hop where needed, recency-weighted by default. Different question shapes route to different retrieval strategies.
- Curation: contradiction resolution, decay tuning, dedup, drift correction. Runs continuously.
The benchmark shows what this gets you: recall stays sharp at 7M+ tokens of conversation history. See the numbers.
If you want to use it, try Moss — free tier, no card. The first time it surfaces something from three months ago that you'd forgotten you said, you'll know which of the four parts most products are skipping.
Or, if you're still mapping the conceptual space, the post that gave us the vocabulary is Context window vs memory vs persistence. And the problem statement that started this thread is Why your AI keeps forgetting.
Try Moss · Blog · Home · Benchmark