About
The Project
Memory Arena is a unified benchmark tracker for AI memory systems. It brings together evaluation data from across the memory landscape — agent memory layers, RAG retrieval, personalization stores, long-context extension, knowledge graphs, episodic session buffers, and lifelong-learning systems.
Memory is fragmented across dozens of products and benchmarks, each measuring different capabilities with different metrics, scales, and evaluation methodologies. Memory Arena normalizes and aggregates this data into a single, navigable interface for researchers, engineers, and decision-makers.
Whether you are choosing a memory system for a downstream application, benchmarking your own research, or surveying the state of the art, the Arena gives you a complete picture in one place.
Score Provenance
Every score on Memory Arena carries a provenance badge indicating where the data came from. This is the most important thing to understand about our data.
What “Estimated” Means
The majority of scores on Memory Arena are currently estimates. These are NOT from real benchmark runs. They are algorithmic approximations derived from each system’s capability profile (retrieval accuracy, long-range recall, etc.) using weighted mappings to benchmark domains. Estimated scores are shown at reduced opacity with a dashed border and an amber badge.
Why We Show Estimates
Most memory systems have no published benchmark data. Without estimates, our catalog of 235 systems would be 93% empty. Estimates provide directional comparison while we work to collect real data. As new research and vendor publications become available, we replace estimates with verified numbers.
Normalization
Percentile rankings are computed across all systems for each benchmark. The elevation tier system (Summit, Peak, Plateau, Valley, Fault, Uncharted) maps percentiles to intuitive topographic categories.
Architecture Taxonomy
Memory systems span 9 architecture families, each with distinct strengths and trade-offs.
Vector RAG
Dense embedding stores queried by approximate nearest-neighbor search. The default RAG pattern: chunk corpus, embed, store in a vector index, retrieve top-k at query time and stuff into the prompt. Foundation for most production memory systems today.
Graph RAG
Extracts entities and relations from the corpus into a property graph, then queries the graph (often alongside a vector index) at retrieval time. Strong on multi-hop and explanatory questions where path traversal matters more than nearest-neighbor similarity.
Hierarchical Summary
Tiered memory layers in which recent items are kept verbatim and older items are progressively summarized. Inspired by human episodic-to-semantic compression. Trades some fidelity for unbounded effective context.
Episodic Buffer
Ring or queue buffers of discrete episodes (turns, sessions, observations) with explicit recency, importance, and decay operators. Closer to a transcript log than a search index. Common in agent frameworks that need ordered playback.
Agentic Workflow
LLM-orchestrated read/write/reflect loops in which the model itself decides when to query, update, or summarize memory. The MemGPT-style 'memory as a tool' pattern. Highest flexibility, highest cost, hardest to evaluate.
KV Cache Extension
Architectural tricks that extend or compress the attention KV cache directly, rather than building an external store. Includes paging, eviction, sparsity, and learned compression. Lives in the model runtime, not the application layer.
External Memory Network
Learned read/write heads attached to a slot-based memory store. The DNC and Neural Turing Machine lineage. Largely a research direction rather than a production pattern, but resurfaces inside hybrid systems.
Knowledge Base
Structured stores — triple stores, relational schemas, document databases — used as authoritative memory rather than as retrieval material. The LLM queries or updates the KB through a tool layer.
Hybrid
Two or more of the above patterns surfaced behind a single API. Most production-grade memory systems converge here: vector retrieval for breadth, graph for relations, structured store for personalization, summaries for the long tail.
Team
Kanishk Patel
Author
Project: Memory Systems Taxonomy and Benchmarking (MSTB). A comprehensive effort to map, categorize, and evaluate the rapidly evolving AI memory landscape.
Contributing
Memory Arena is open source. We welcome contributions of new system entries, benchmark data, bug fixes, and feature ideas.
View on GitHubFAQ
What is a memory system?
A memory system is the layer that gives an AI application persistence — the ability to remember facts, conversations, user preferences, or task state across turns and sessions. Memory systems span agent memory layers, RAG retrieval over corpora, personalization stores, long-context windows, knowledge graphs, episodic session buffers, and lifelong-learning stores.
Are the benchmark scores real?
Some are, most are not yet. Each score shows a provenance badge: 'Self-Reported' scores come from vendor papers and documentation. 'Estimated' scores are algorithmic approximations derived from capability profiles — they are NOT from real benchmark runs. We are working to replace estimates with verified data.
What does 'Estimated' mean?
Estimated scores are derived from a system's capability profile (retrieval accuracy, long-range recall, etc.) using weighted mappings to benchmark domains. They provide a rough directional comparison but should not be treated as real benchmark results. We show them to provide a comprehensive catalog while being transparent that they are approximations.
How often is the data updated?
We update scores as new papers and evaluations are published. Our goal is to progressively replace estimated scores with verified data.
Is the data available programmatically?
Yes. All data is available as static JSON files through our public API. See the API Documentation page for endpoint details and example responses.