Back to Arena
Benchmark Catalog
All benchmarks used to evaluate AI memory systems, grouped by category. Click any benchmark to see detailed information and system rankings.
Long-Context Retrieval
5 benchmarksRULER
automaticRULER: What's the Real Context Size of Your Long-Context Language Models
COLM 202413 dims71 systems
long-context
NIAH
automaticNeedle in a Haystack
Open-source benchmark1 dims13 systems
long-context
LooGLE
automaticLooGLE: Can Long-Context Language Models Understand Long Contexts?
ACL 20247 dims22 systems
long-context
LongBench
automaticLongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
ACL 202421 dims121 systems
long-context
∞Bench
automaticInfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens
ACL 202412 dims32 systems
long-context
Multi-Turn Recall
2 benchmarksCross-Session Memory
1 benchmarkMulti-Hop QA
3 benchmarksBABILong
automaticBABILong: Testing the Limits of LLMs with Long-Context Reasoning-in-a-Haystack
NeurIPS 202420 dims69 systems
long-contextagent-memory
MultiHop-RAG
automaticMultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
COLM 20244 dims150 systems
rag-retrievalknowledge-graph
HotpotQA
automaticHotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
EMNLP 20182 dims188 systems
rag-retrievalknowledge-graph