Back to Arena

About

The Project

Memory Arena is a unified benchmark tracker for AI memory systems. It brings together evaluation data from across the memory landscape — agent memory layers, RAG retrieval, personalization stores, long-context extension, knowledge graphs, episodic session buffers, and lifelong-learning systems.

Memory is fragmented across dozens of products and benchmarks, each measuring different capabilities with different metrics, scales, and evaluation methodologies. Memory Arena normalizes and aggregates this data into a single, navigable interface for researchers, engineers, and decision-makers.

Whether you are choosing a memory system for a downstream application, benchmarking your own research, or surveying the state of the art, the Arena gives you a complete picture in one place.

Methodology

Score Collection

Scores are sourced from published papers, official leaderboards, and verified community reproductions. Every score entry includes the source reference and collection date.

Verification

Self-reported scores are cross-referenced against independent evaluations when available. Discrepancies are flagged and documented. We prioritize reproducible evaluations over single-source claims.

Normalization

Percentile rankings are computed across all systems for each benchmark. The elevation tier system (Summit, Peak, Plateau, Valley, Fault, Uncharted) maps percentiles to intuitive topographic categories, reflecting the terrain metaphor of the Arena.

Updates

Major benchmarks are refreshed quarterly. New system entries are added within one week of publication. Community contributions are reviewed via GitHub pull requests.

Architecture Taxonomy

Memory systems span 9 architecture families, each with distinct strengths and trade-offs.

Vector RAG

Dense embedding stores queried by approximate nearest-neighbor search. The default RAG pattern: chunk corpus, embed, store in a vector index, retrieve top-k at query time and stuff into the prompt. Foundation for most production memory systems today.

Graph RAG

Extracts entities and relations from the corpus into a property graph, then queries the graph (often alongside a vector index) at retrieval time. Strong on multi-hop and explanatory questions where path traversal matters more than nearest-neighbor similarity.

Hierarchical Summary

Tiered memory layers in which recent items are kept verbatim and older items are progressively summarized. Inspired by human episodic-to-semantic compression. Trades some fidelity for unbounded effective context.

Episodic Buffer

Ring or queue buffers of discrete episodes (turns, sessions, observations) with explicit recency, importance, and decay operators. Closer to a transcript log than a search index. Common in agent frameworks that need ordered playback.

Agentic Workflow

LLM-orchestrated read/write/reflect loops in which the model itself decides when to query, update, or summarize memory. The MemGPT-style 'memory as a tool' pattern. Highest flexibility, highest cost, hardest to evaluate.

KV Cache Extension

Architectural tricks that extend or compress the attention KV cache directly, rather than building an external store. Includes paging, eviction, sparsity, and learned compression. Lives in the model runtime, not the application layer.

External Memory Network

Learned read/write heads attached to a slot-based memory store. The DNC and Neural Turing Machine lineage. Largely a research direction rather than a production pattern, but resurfaces inside hybrid systems.

Knowledge Base

Structured stores — triple stores, relational schemas, document databases — used as authoritative memory rather than as retrieval material. The LLM queries or updates the KB through a tool layer.

Hybrid

Two or more of the above patterns surfaced behind a single API. Most production-grade memory systems converge here: vector retrieval for breadth, graph for relations, structured store for personalization, summaries for the long tail.

Team

KP

Kanishk Patel

Author

Project: Memory Systems Taxonomy and Benchmarking (MSTB). A comprehensive effort to map, categorize, and evaluate the rapidly evolving AI memory landscape.

Contributing

Memory Arena is open source. We welcome contributions of new system entries, benchmark data, bug fixes, and feature ideas.

View on GitHub

FAQ

What is a memory system?

A memory system is the layer that gives an AI application persistence — the ability to remember facts, conversations, user preferences, or task state across turns and sessions. Memory systems span agent memory layers, RAG retrieval over corpora, personalization stores, long-context windows, knowledge graphs, episodic session buffers, and lifelong-learning stores.

How are benchmark scores collected?

Scores come from three sources: (1) official papers and technical reports, (2) community reproductions and evaluations, and (3) leaderboard scrapes from benchmark websites. Each score entry is tagged with its source and date for full traceability.

How often is the data updated?

We aim to update scores quarterly for major benchmarks, and within one week of new system releases for widely-tracked memory systems. The update frequency for each benchmark is listed on its detail page.

Can I submit my own memory system?

Yes. Use the Submit page to either self-report your scores via a GitHub Issue (Track A), or sign up to have our team evaluate your system across the full benchmark suite (Track B, coming soon).

Is the data available programmatically?

Yes. All data is available as static JSON files through our public API. See the API Documentation page for endpoint details and example responses. A full REST API (V2) is planned.