Benchmark
Moss vs Claude, ChatGPT, Gemini — persistent remembering at scale.
Headline: Moss maintains accurate recall across 7M+ tokens of conversation history. Claude, ChatGPT, and Gemini all degrade noticeably past roughly 1M tokens.
The non-obvious part: Moss carries context into every new conversation. The other three fetch a shallow facts table on demand. That difference is why the gap shows up earlier in daily use than raw token numbers suggest.
What we tested
| Capability | Moss | Claude | ChatGPT | Gemini |
| Accurate recall at 100k tokens | Yes | Yes | Yes | Yes |
| Accurate recall at 1M tokens | Yes | Degrades | Degrades | Partial |
| Accurate recall at 3M tokens | Yes | Fails | Fails | Fails |
| Accurate recall at 7M+ tokens | Yes | Fails | Fails | Fails |
| Cross-conversation recall | Yes, unprompted | Limited | Weak | Weak |
| Proactive surfacing | Yes | No | No | No |
| Contradiction resolution | Yes, nightly curator | No | Partial | No |
How Moss does it
- Carried, not fetched. Every conversation starts with the relevant past already in the context window.
- Structured memory, not a facts table. Moss extracts semantic units, entities, and relationships into a personal knowledge graph with contradiction-aware curation.
- Orchestrated model arsenal. Retrieval, abstraction, and synthesis each go to the model that's best at that specific operation — Claude, GPT, Gemini, Groq, or Perplexity — chosen server-side.
Methodology
Recall tests introduce a structured fact early in a conversation, then ask the model to retrieve it after N additional turns, scaled to push the effective context window. We track exact recall, directional recall, cross-thread recall, and proactive surface.
What this page isn't yet
A formal peer-reviewed paper. A published open dataset. An independently-reproduced evaluation. We're working toward all three. Email hello@mossmemory.com for the full test suite.
Try Moss · Home · FAQ · Blog