Back to Benchmarks

BABILong

BABILong: Testing the Limits of LLMs with Long-Context Reasoning-in-a-Haystack

Benchmark Metadata

PublisherAIRI
VenueNeurIPS 2024
Evaluation Typeautomatic
Dimensions20
Test Prompts1,000
ScoringHigher is better
Update Frequencyannual
LeaderboardView Leaderboard

What It Measures

  • Multi-fact reasoning under distractors
  • Single supporting-fact recall
  • Two- and three-supporting-fact reasoning
  • Counting and list-tracking
  • Spatial and temporal relations

What It Does Not Measure

  • Generation quality
  • Open-ended dialogue
  • Code understanding

All Systems Evaluated(69 systems)

RankSystemScore
#1Jina AI EmbeddingsJina AI GmbH83.6
#2MambaCMU / Princeton (Gu, Dao)82.5
#3Titanslucidrains (community) / paper by Google Research82.4
#4MemoryLLMUCSD / Apple (Wang et al.)82.1
#5SCMBeihang / NLPR (Wang et al.)81.3
#6Memorizing TransformerGoogle Research (Wu, Rabe, Hutchins, Szegedy)80.3
#7Recurrent Memory TransformerMIPT / DeepPavlov (Bulatov, Kuratov, Burtsev)80.3
#8GAMVectorSpaceLab (BAAI-related)80
#9Backboard IOBackboard.io79.9
#10ReMeModelScope (Alibaba)79.8
#11HippoRAG 2OSU NLP Group79.4
#12EM-LLMem-llm (academic consortium)78.8
#13VoyagerNVIDIA / Caltech / UT Austin / Stanford / ASU / UW (Wang et al.)78.4
#14memUNevaMind-AI78.3
#15kNN-LMStanford / Facebook AI Research (Khandelwal et al.)78.2
#16RecallMCisco Research / independent (Kynoch & Latapie)78.1
#17MoTFudan University (Li & Qiu)77.9
#18Generative AgentsStanford University / Google Research77.2
#19RWKVRWKV Foundation / BlinkDL community77.2
#20KindroidKindroid77.1
#21ArcMemoUC Berkeley / Stanford (Ho et al.)77
#22Charlie MnemonicGoodAI77
#23Adept AIAdept AI Labs (acquired by Amazon 2024)76.9
#24MemaryKingjulio823876.8
#25ParadotWithFeeling.AI76.8
#26Pickle AISoul Computer (YC-backed)76.3
#27Agent Workflow MemoryCMU (Wang, Mao, Fried, Neubig)76
#28Nomi AIGlimpse AI, Inc.76
#29RAPTORStanford (Sarthi, Abdullah et al.)76
#30R3MemHKUST (2025)75.8
#31LarimarIBM Research75.7
#32AgentScopeModelScope (Alibaba)75.6
#33ScissorhandsRice / Stanford / Meta (Liu et al.)75.6
#34∞ FormerInstituto de Telecomunicações / DeepMind / IST (Martins, Marinho, Martins)75.1
#35Lyzr CognisLyzr AI74.9
#36MemformerUC Santa Barbara / Amazon (Wu, Lan, Liu, et al.)74.9
#37TRIMEPrinceton NLP (Zhong, Lei, Chen)74.4
#38MempZhejiang University (Fang et al.)74.3
#39OS-Copilot / FRIDAYShanghai AI Lab / MMLab (Wu et al.)74.2
#40Compressive TransformerDeepMind (Rae et al.)74.1
#41H2OUT Austin / Rice / CMU / Stanford / Meta (Zhang et al.)73.9
#42LM-InfiniteIllinois / Meta (Han et al.)73.8
#43MnemosyneJohns Hopkins / independent (2025)73.7
#44A-MEMAGI Research / Rutgers73.6
#45KnowAgentzjunlp (Zhejiang University)73.5
#46ReplikaLuka, Inc.73.3
#47Landmark AttentionEPFL (Mohtashami, Jaggi)73.2
#48Personal AIPersonal AI73.2
#49BabyAGIYohei Nakajima73.1
#50Memory³Institute for Advanced Algorithms Research Shanghai / Peking University73
#51HEMAindependent (Ahn et al.)72.9
#52ReflexionNortheastern / MIT / Princeton (Shinn et al.)72.7
#53LongMemUCSB / Microsoft Research72.6
#54Bee ComputerBee (acquired by Amazon 2026)72.5
#55Second MeMindverse (Shang, Li, et al.)72.2
#56MemOSMemTensor (Li, Zhang, et al.)72.1
#57Tab AITab (Avi Schiffmann)72.1
#58JARVIS-1CraftJarvis71.6
#59Generative AgentsStanford / Google71.5
#60CrewAI EnterpriseCrewAI Inc.71.4
#61ExpeLTsinghua University (Zhao et al.)71.2
#62ICAEMicrosoft Research (Ge et al.)70.8
#63HippoRAGOSU NLP Group (Ohio State University)70.3
#64LanceDBLanceDB Inc. (YC S22)65.8
#65Activeloop Deep LakeActiveloop Inc.62.6
#66Activation BeaconBAAI / Renmin University (Zhang et al.)61.3
#67MemoryBankInstitute of Software, Chinese Academy of Sciences59.7
#68StreamingLLMMIT Han Lab / Meta AI (Xiao et al.)59.3
#69Heyday AIHeyday (shut down 2025)58.7