← All posts · 2026-04-28
Why your AI keeps forgetting (and why nobody is fixing it)
You've had the conversation. Maybe last week. You explained your project, your stack, your team, the constraint that makes the obvious answer wrong. The model nodded along (figuratively). It got the nuance. Reply was good.
Then you opened a new chat and started over. Same explanation. Same setup. Same five-minute preamble before you can ask the question that actually matters.
This is the default state of every frontier LLM in 2026. ChatGPT, Claude, Gemini, Grok — all of them. They are stateless by design.
This post is about why that's the case, what the workarounds are, and why none of the major labs are racing to fix it.
The simple answer
Each turn you send to an LLM is independent. The model has no notion of "you" between requests. It loads weights, reads the input you just sent, generates a response, and unloads context. Nothing persists. The reason that doesn't get fixed is mostly token economics, which we'll come back to.
Memory features bolted onto consumer products (ChatGPT memory, Claude projects, Gemini saved info) work around this by prepending something to each new request — a list of facts, a system prompt, a project file. That's not memory in the brain-cell sense. It's a manual save-state.
Why it works at all: the model still doesn't "remember", but the prepended text gives it enough context to behave as if it does. Why it breaks down: the prepended text is small. Has to be — for the token-economic reasons below.
Token economics is the real ceiling
Every token sent into an LLM costs money and time. Input tokens are roughly 3-5× cheaper than output tokens but still bill at $0.05 to $5 per million. Latency is roughly linear in input length. Larger context = slower response.
A "memory" feature that dumps your last 6 months of conversation history into every new chat would be:
- Slow. Reading 500K tokens of history before generating a single token of response = 30+ seconds before the model says anything.
- Expensive. Multiply by every user, every turn. The economics of $20/month consumer subscriptions don't work.
- Worse than no memory. Past about 1M tokens, model recall drops sharply. The relevant fact gets lost in the haystack of everything else. See the benchmark for what this looks like in practice.
So memory features are kept small. ChatGPT's memory caps at a couple hundred short facts. Claude's project files cap at the project's context budget. Gemini's saved info is similar. The cap isn't a UX choice — it's a cost-and-quality choice. Bigger memory makes the product worse on every axis the labs care about.
Why labs aren't racing to fix it
Three reasons, in rough order of importance:
1. The win condition is bigger context, not better memory. OpenAI, Anthropic, Google have spent the last three years pushing context windows from 8K → 200K → 2M tokens. The bet is that with a large enough window, persistent memory becomes "just put everything in context" and the problem dissolves. Why prioritise this lever? Bigger context scales as a model-architecture problem (their core competency), it's a single number marketers can put on a slide, and it benefits every use case from coding to research at the same time. Persistence benefits a narrower slice and requires building infrastructure outside the model itself. It hasn't dissolved — recall at 2M tokens is real but not great, and at 7M+ it falls apart — but the bet is still active.
2. Memory is a differentiator they don't want to commit to. A general-purpose model with deep memory of you is sticky. Sticky is good for retention, but it also creates vendor lock-in the labs would rather avoid carrying. If your three-year context lives at OpenAI, switching to Anthropic costs you that whole substrate. That's a feature for them when you're already on, but a regulatory and PR liability when AI-portability becomes a real conversation (and it's becoming one). The labs prefer features that are useful per-session and forgotten per-session, because that keeps the model itself the focus and keeps the switching cost lower.
3. It's hard. Real persistent memory isn't a context-window-bigger problem. It's a retrieval-and-curation problem. You need to know what to surface, when, and why. The labs are LLM-shaped companies, not knowledge-management-shaped companies. Building good memory is closer to a search engine + database problem than a model-architecture problem, and that's not where their R&D dollars go.
The workarounds people actually use
If you've been working with LLMs seriously, you've probably hit one of these:
- Re-explain your context every conversation. Most common. Costs you 2-5 minutes per chat. Adds up fast.
- Maintain a "context document" you paste at the start of every chat. Slightly better. Still manual. Goes stale.
- Use Claude projects or ChatGPT memory. Works for the small set of facts they support. Falls over for everything else.
- Switch to a model with bigger context (Gemini 2M) and stuff history in. Slow, expensive, and the recall isn't great at the high end.
- Build a local RAG pipeline. Engineers do this. Works in theory. Burns hundreds of hours in practice.
None of these are good. Each has the same shape: the user is doing memory's job, manually, every day.
What memory actually requires
Real persistent memory — the kind that makes an AI feel like a thinking partner instead of a search bar — needs four things:
- Capture. Every conversation extracted into structured units (facts, decisions, preferences, contradictions).
- Storage. Those units indexed by semantics, time, and relationship — not just dumped into a list.
- Retrieval. When you start a new conversation, the relevant past gets surfaced based on what you're actually saying right now, not just what you previously flagged.
- Curation. Old or contradicted units get resolved. Noise gets pruned. The memory stays useful at scale.
We dig into the architecture of each in What persistent memory actually requires. It's the post that follows this one.
If you want the conceptual ground first — the difference between context window, memory, and persistence (terms that get used interchangeably and shouldn't be) — read Context window vs memory vs persistence next.
Where Moss fits
Moss is built on the four-things stack above. Every conversation gets scribed into a personal knowledge graph. Retrieval runs on every turn against your full history (we have users at 7M+ tokens, recall stays sharp — see the benchmark). The curator agent resolves contradictions and prunes drift.
It's not magic and it isn't free. The free tier is real (15 exchanges per day, persistent memory, no card) — paid tiers from $9/mo for higher caps and document upload, scaling up for power users. Building this layer costs real compute and real engineering, which is why it isn't the labs' default move. But it's the missing layer no frontier lab is shipping, and once you've used it for two weeks the stateless-by-default ChatGPT/Claude experience starts feeling like a phone with no contacts app.
Try Moss on the free tier — no credit card. Or import your ChatGPT history to seed memory from what you've already said.
The forgetting isn't going away on its own. It's a structural feature of how the labs are built, not a bug they're working on.
Try Moss · Blog · Home · Benchmark