Back to Arena

About

The Project

Memory Arena is a unified benchmark tracker for AI memory systems. It brings together evaluation data from across the memory landscape — agent memory layers, RAG retrieval, personalization stores, long-context extension, knowledge graphs, episodic session buffers, and lifelong-learning systems.

Memory is fragmented across dozens of products and benchmarks, each measuring different capabilities with different metrics, scales, and evaluation methodologies. Memory Arena normalizes and aggregates this data into a single, navigable interface for researchers, engineers, and decision-makers.

Whether you are choosing a memory system for a downstream application, benchmarking your own research, or surveying the state of the art, the Arena gives you a complete picture in one place.

Score Provenance

Every score on Memory Arena carries a provenance badge indicating where the data came from. This is the most important thing to understand about our data.

0
Verified
Independently confirmed
87
Self-Reported
From vendor papers/docs
0
Community
Third-party reproductions
1230
Estimated
Algorithmic approximation

What “Estimated” Means

The majority of scores on Memory Arena are currently estimates. These are NOT from real benchmark runs. They are algorithmic approximations derived from each system’s capability profile (retrieval accuracy, long-range recall, etc.) using weighted mappings to benchmark domains. Estimated scores are shown at reduced opacity with a dashed border and an amber badge.

Why We Show Estimates

Most memory systems have no published benchmark data. Without estimates, our catalog of 235 systems would be 93% empty. Estimates provide directional comparison while we work to collect real data. As new research and vendor publications become available, we replace estimates with verified numbers.

Normalization

Percentile rankings are computed across all systems for each benchmark. The elevation tier system (Summit, Peak, Plateau, Valley, Fault, Uncharted) maps percentiles to intuitive topographic categories.

Architecture Taxonomy

Memory systems span 9 architecture families, each with distinct strengths and trade-offs.

Vector RAG

Dense embedding stores queried by approximate nearest-neighbor search. The default RAG pattern: chunk corpus, embed, store in a vector index, retrieve top-k at query time and stuff into the prompt. Foundation for most production memory systems today.

Graph RAG

Extracts entities and relations from the corpus into a property graph, then queries the graph (often alongside a vector index) at retrieval time. Strong on multi-hop and explanatory questions where path traversal matters more than nearest-neighbor similarity.

Hierarchical Summary

Tiered memory layers in which recent items are kept verbatim and older items are progressively summarized. Inspired by human episodic-to-semantic compression. Trades some fidelity for unbounded effective context.

Episodic Buffer

Ring or queue buffers of discrete episodes (turns, sessions, observations) with explicit recency, importance, and decay operators. Closer to a transcript log than a search index. Common in agent frameworks that need ordered playback.

Agentic Workflow

LLM-orchestrated read/write/reflect loops in which the model itself decides when to query, update, or summarize memory. The MemGPT-style 'memory as a tool' pattern. Highest flexibility, highest cost, hardest to evaluate.

KV Cache Extension

Architectural tricks that extend or compress the attention KV cache directly, rather than building an external store. Includes paging, eviction, sparsity, and learned compression. Lives in the model runtime, not the application layer.

External Memory Network

Learned read/write heads attached to a slot-based memory store. The DNC and Neural Turing Machine lineage. Largely a research direction rather than a production pattern, but resurfaces inside hybrid systems.

Knowledge Base

Structured stores — triple stores, relational schemas, document databases — used as authoritative memory rather than as retrieval material. The LLM queries or updates the KB through a tool layer.

Hybrid

Two or more of the above patterns surfaced behind a single API. Most production-grade memory systems converge here: vector retrieval for breadth, graph for relations, structured store for personalization, summaries for the long tail.

Team

KP

Kanishk Patel

Author

Project: Memory Systems Taxonomy and Benchmarking (MSTB). A comprehensive effort to map, categorize, and evaluate the rapidly evolving AI memory landscape.

Contributing

Memory Arena is open source. We welcome contributions of new system entries, benchmark data, bug fixes, and feature ideas.

View on GitHub

FAQ

What is a memory system?

A memory system is the layer that gives an AI application persistence — the ability to remember facts, conversations, user preferences, or task state across turns and sessions. Memory systems span agent memory layers, RAG retrieval over corpora, personalization stores, long-context windows, knowledge graphs, episodic session buffers, and lifelong-learning stores.

Are the benchmark scores real?

Some are, most are not yet. Each score shows a provenance badge: 'Self-Reported' scores come from vendor papers and documentation. 'Estimated' scores are algorithmic approximations derived from capability profiles — they are NOT from real benchmark runs. We are working to replace estimates with verified data.

What does 'Estimated' mean?

Estimated scores are derived from a system's capability profile (retrieval accuracy, long-range recall, etc.) using weighted mappings to benchmark domains. They provide a rough directional comparison but should not be treated as real benchmark results. We show them to provide a comprehensive catalog while being transparent that they are approximations.

How often is the data updated?

We update scores as new papers and evaluations are published. Our goal is to progressively replace estimated scores with verified data.

Is the data available programmatically?

Yes. All data is available as static JSON files through our public API. See the API Documentation page for endpoint details and example responses.