Benchmark Catalog

All benchmarks used to evaluate AI memory systems, grouped by category. Click any benchmark to see detailed information and system rankings.

Long-Context Retrieval

5 benchmarks

RULER

automatic

RULER: What's the Real Context Size of Your Long-Context Language Models

COLM 202413 dims71 systems

long-context

NIAH

automatic

Needle in a Haystack

Open-source benchmark1 dims13 systems

long-context

LooGLE

automatic

LooGLE: Can Long-Context Language Models Understand Long Contexts?

ACL 20247 dims22 systems

long-context

LongBench

automatic

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

ACL 202421 dims121 systems

long-context

∞Bench

automatic

InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens

ACL 202412 dims32 systems

long-context

Multi-Turn Recall

2 benchmarks

LoCoMo

hybrid

LoCoMo: Long-Term Conversational Memory Benchmark

ACL 20244 dims161 systems

agent-memoryepisodic-session

MemoryBank

human

MemoryBank: Enhancing LLMs with Long-Term Memory

AAAI 20243 dims88 systems

personalizationepisodic-session

Cross-Session Memory

1 benchmark

LongMemEval

automatic

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

arXiv preprint5 dims174 systems

agent-memoryepisodic-session

Multi-Hop QA

3 benchmarks

BABILong

automatic

BABILong: Testing the Limits of LLMs with Long-Context Reasoning-in-a-Haystack

NeurIPS 202420 dims69 systems

long-contextagent-memory

MultiHop-RAG

automatic

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

COLM 20244 dims150 systems

rag-retrievalknowledge-graph

HotpotQA

automatic

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

EMNLP 20182 dims188 systems

rag-retrievalknowledge-graph

Agent Task Memory

1 benchmark

AgentBench-Mem

automatic

AgentBench Memory Track

ICLR 20248 dims144 systems

agent-memory

Personalization

1 benchmark

PerLTQA

automatic

PerLTQA: A Personal Long-Term Memory Question Answering Dataset

arXiv preprint3 dims22 systems

personalizationagent-memory

Factuality / Grounding

1 benchmark

RAGAS

automatic

RAGAS: Automated Evaluation of Retrieval-Augmented Generation

EACL 20244 dims67 systems

rag-retrieval