Back to Benchmarks
BABILong
BABILong: Testing the Limits of LLMs with Long-Context Reasoning-in-a-Haystack
Benchmark Metadata
PublisherAIRI
VenueNeurIPS 2024
Evaluation Typeautomatic
Dimensions20
Test Prompts1,000
ScoringHigher is better
Update Frequencyannual
PaperView Paper
LeaderboardView Leaderboard
What It Measures
- Multi-fact reasoning under distractors
- Single supporting-fact recall
- Two- and three-supporting-fact reasoning
- Counting and list-tracking
- Spatial and temporal relations
What It Does Not Measure
- Generation quality
- Open-ended dialogue
- Code understanding
All Systems Evaluated(69 systems)
| Rank | System | Score |
|---|---|---|
| #1 | Jina AI EmbeddingsJina AI GmbH | 83.6 |
| #2 | MambaCMU / Princeton (Gu, Dao) | 82.5 |
| #3 | Titanslucidrains (community) / paper by Google Research | 82.4 |
| #4 | MemoryLLMUCSD / Apple (Wang et al.) | 82.1 |
| #5 | SCMBeihang / NLPR (Wang et al.) | 81.3 |
| #6 | Memorizing TransformerGoogle Research (Wu, Rabe, Hutchins, Szegedy) | 80.3 |
| #7 | Recurrent Memory TransformerMIPT / DeepPavlov (Bulatov, Kuratov, Burtsev) | 80.3 |
| #8 | GAMVectorSpaceLab (BAAI-related) | 80 |
| #9 | Backboard IOBackboard.io | 79.9 |
| #10 | ReMeModelScope (Alibaba) | 79.8 |
| #11 | HippoRAG 2OSU NLP Group | 79.4 |
| #12 | EM-LLMem-llm (academic consortium) | 78.8 |
| #13 | VoyagerNVIDIA / Caltech / UT Austin / Stanford / ASU / UW (Wang et al.) | 78.4 |
| #14 | memUNevaMind-AI | 78.3 |
| #15 | kNN-LMStanford / Facebook AI Research (Khandelwal et al.) | 78.2 |
| #16 | RecallMCisco Research / independent (Kynoch & Latapie) | 78.1 |
| #17 | MoTFudan University (Li & Qiu) | 77.9 |
| #18 | Generative AgentsStanford University / Google Research | 77.2 |
| #19 | RWKVRWKV Foundation / BlinkDL community | 77.2 |
| #20 | KindroidKindroid | 77.1 |
| #21 | ArcMemoUC Berkeley / Stanford (Ho et al.) | 77 |
| #22 | Charlie MnemonicGoodAI | 77 |
| #23 | Adept AIAdept AI Labs (acquired by Amazon 2024) | 76.9 |
| #24 | MemaryKingjulio8238 | 76.8 |
| #25 | ParadotWithFeeling.AI | 76.8 |
| #26 | Pickle AISoul Computer (YC-backed) | 76.3 |
| #27 | Agent Workflow MemoryCMU (Wang, Mao, Fried, Neubig) | 76 |
| #28 | Nomi AIGlimpse AI, Inc. | 76 |
| #29 | RAPTORStanford (Sarthi, Abdullah et al.) | 76 |
| #30 | R3MemHKUST (2025) | 75.8 |
| #31 | LarimarIBM Research | 75.7 |
| #32 | AgentScopeModelScope (Alibaba) | 75.6 |
| #33 | ScissorhandsRice / Stanford / Meta (Liu et al.) | 75.6 |
| #34 | ∞ FormerInstituto de Telecomunicações / DeepMind / IST (Martins, Marinho, Martins) | 75.1 |
| #35 | Lyzr CognisLyzr AI | 74.9 |
| #36 | MemformerUC Santa Barbara / Amazon (Wu, Lan, Liu, et al.) | 74.9 |
| #37 | TRIMEPrinceton NLP (Zhong, Lei, Chen) | 74.4 |
| #38 | MempZhejiang University (Fang et al.) | 74.3 |
| #39 | OS-Copilot / FRIDAYShanghai AI Lab / MMLab (Wu et al.) | 74.2 |
| #40 | Compressive TransformerDeepMind (Rae et al.) | 74.1 |
| #41 | H2OUT Austin / Rice / CMU / Stanford / Meta (Zhang et al.) | 73.9 |
| #42 | LM-InfiniteIllinois / Meta (Han et al.) | 73.8 |
| #43 | MnemosyneJohns Hopkins / independent (2025) | 73.7 |
| #44 | A-MEMAGI Research / Rutgers | 73.6 |
| #45 | KnowAgentzjunlp (Zhejiang University) | 73.5 |
| #46 | ReplikaLuka, Inc. | 73.3 |
| #47 | Landmark AttentionEPFL (Mohtashami, Jaggi) | 73.2 |
| #48 | Personal AIPersonal AI | 73.2 |
| #49 | BabyAGIYohei Nakajima | 73.1 |
| #50 | Memory³Institute for Advanced Algorithms Research Shanghai / Peking University | 73 |
| #51 | HEMAindependent (Ahn et al.) | 72.9 |
| #52 | ReflexionNortheastern / MIT / Princeton (Shinn et al.) | 72.7 |
| #53 | LongMemUCSB / Microsoft Research | 72.6 |
| #54 | Bee ComputerBee (acquired by Amazon 2026) | 72.5 |
| #55 | Second MeMindverse (Shang, Li, et al.) | 72.2 |
| #56 | MemOSMemTensor (Li, Zhang, et al.) | 72.1 |
| #57 | Tab AITab (Avi Schiffmann) | 72.1 |
| #58 | JARVIS-1CraftJarvis | 71.6 |
| #59 | Generative AgentsStanford / Google | 71.5 |
| #60 | CrewAI EnterpriseCrewAI Inc. | 71.4 |
| #61 | ExpeLTsinghua University (Zhao et al.) | 71.2 |
| #62 | ICAEMicrosoft Research (Ge et al.) | 70.8 |
| #63 | HippoRAGOSU NLP Group (Ohio State University) | 70.3 |
| #64 | LanceDBLanceDB Inc. (YC S22) | 65.8 |
| #65 | Activeloop Deep LakeActiveloop Inc. | 62.6 |
| #66 | Activation BeaconBAAI / Renmin University (Zhang et al.) | 61.3 |
| #67 | MemoryBankInstitute of Software, Chinese Academy of Sciences | 59.7 |
| #68 | StreamingLLMMIT Han Lab / Meta AI (Xiao et al.) | 59.3 |
| #69 | Heyday AIHeyday (shut down 2025) | 58.7 |