BABILong

Name: BABILong: Testing the Limits of LLMs with Long-Context Reasoning-in-a-Haystack
Creator: AIRI
Keywords: multi-hop-qa, long-context, agent-memory

BABILong: Testing the Limits of LLMs with Long-Context Reasoning-in-a-Haystack

Benchmark Metadata

PublisherAIRI

VenueNeurIPS 2024

Evaluation Typeautomatic

Dimensions20

Test Prompts1,000

ScoringHigher is better

Update Frequencyannual

PaperView Paper

LeaderboardView Leaderboard

What It Measures

Multi-fact reasoning under distractors
Single supporting-fact recall
Two- and three-supporting-fact reasoning
Counting and list-tracking
Spatial and temporal relations

What It Does Not Measure

Generation quality
Open-ended dialogue
Code understanding

All Systems Evaluated(68 systems)

68 estimated

Rank	System	Score	Provenance	Source
#1	Jina AI EmbeddingsJina AI GmbH	83.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#2	MambaCMU / Princeton (Gu, Dao)	82.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#3	Titanslucidrains (community) / paper by Google Research	82.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#4	MemoryLLMUCSD / Apple (Wang et al.)	82.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#5	SCMBeihang / NLPR (Wang et al.)	81.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#6	Memorizing TransformerGoogle Research (Wu, Rabe, Hutchins, Szegedy)	80.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#7	Recurrent Memory TransformerMIPT / DeepPavlov (Bulatov, Kuratov, Burtsev)	80.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#8	GAMVectorSpaceLab (BAAI-related)	80	Estimated	Arena estimate — derived from capability profile, not independently verified
#9	Backboard IOBackboard.io	79.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#10	ReMeModelScope (Alibaba)	79.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#11	HippoRAG 2OSU NLP Group	79.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#12	EM-LLMem-llm (academic consortium)	78.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#13	VoyagerNVIDIA / Caltech / UT Austin / Stanford / ASU / UW (Wang et al.)	78.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#14	memUNevaMind-AI	78.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#15	kNN-LMStanford / Facebook AI Research (Khandelwal et al.)	78.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#16	RecallMCisco Research / independent (Kynoch & Latapie)	78.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#17	MoTFudan University (Li & Qiu)	77.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#18	Generative AgentsStanford University / Google Research	77.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#19	RWKVRWKV Foundation / BlinkDL community	77.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#20	KindroidKindroid	77.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#21	ArcMemoUC Berkeley / Stanford (Ho et al.)	77	Estimated	Arena estimate — derived from capability profile, not independently verified
#22	Charlie MnemonicGoodAI	77	Estimated	Arena estimate — derived from capability profile, not independently verified
#23	Adept AIAdept AI Labs (acquired by Amazon 2024)	76.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#24	MemaryKingjulio8238	76.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#25	ParadotWithFeeling.AI	76.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#26	Pickle AISoul Computer (YC-backed)	76.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#27	Agent Workflow MemoryCMU (Wang, Mao, Fried, Neubig)	76	Estimated	Arena estimate — derived from capability profile, not independently verified
#28	Nomi AIGlimpse AI, Inc.	76	Estimated	Arena estimate — derived from capability profile, not independently verified
#29	RAPTORStanford (Sarthi, Abdullah et al.)	76	Estimated	Arena estimate — derived from capability profile, not independently verified
#30	R3MemHKUST (2025)	75.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#31	LarimarIBM Research	75.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#32	AgentScopeModelScope (Alibaba)	75.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#33	ScissorhandsRice / Stanford / Meta (Liu et al.)	75.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#34	∞ FormerInstituto de Telecomunicações / DeepMind / IST (Martins, Marinho, Martins)	75.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#35	MemformerUC Santa Barbara / Amazon (Wu, Lan, Liu, et al.)	74.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#36	TRIMEPrinceton NLP (Zhong, Lei, Chen)	74.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#37	MempZhejiang University (Fang et al.)	74.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#38	OS-Copilot / FRIDAYShanghai AI Lab / MMLab (Wu et al.)	74.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#39	Compressive TransformerDeepMind (Rae et al.)	74.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#40	H2OUT Austin / Rice / CMU / Stanford / Meta (Zhang et al.)	73.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#41	LM-InfiniteIllinois / Meta (Han et al.)	73.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#42	MnemosyneJohns Hopkins / independent (2025)	73.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#43	A-MEMAGI Research / Rutgers	73.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#44	KnowAgentzjunlp (Zhejiang University)	73.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#45	ReplikaLuka, Inc.	73.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#46	Landmark AttentionEPFL (Mohtashami, Jaggi)	73.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#47	Personal AIPersonal AI	73.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#48	BabyAGIYohei Nakajima	73.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#49	Memory³Institute for Advanced Algorithms Research Shanghai / Peking University	73	Estimated	Arena estimate — derived from capability profile, not independently verified
#50	HEMAindependent (Ahn et al.)	72.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#51	ReflexionNortheastern / MIT / Princeton (Shinn et al.)	72.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#52	LongMemUCSB / Microsoft Research	72.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#53	Bee ComputerBee (acquired by Amazon 2026)	72.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#54	Second MeMindverse (Shang, Li, et al.)	72.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#55	MemOSMemTensor (Li, Zhang, et al.)	72.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#56	Tab AITab (Avi Schiffmann)	72.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#57	JARVIS-1CraftJarvis	71.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#58	Generative AgentsStanford / Google	71.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#59	CrewAI EnterpriseCrewAI Inc.	71.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#60	ExpeLTsinghua University (Zhao et al.)	71.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#61	ICAEMicrosoft Research (Ge et al.)	70.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#62	HippoRAGOSU NLP Group (Ohio State University)	70.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#63	LanceDBLanceDB Inc. (YC S22)	65.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#64	Activeloop Deep LakeActiveloop Inc.	62.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#65	Activation BeaconBAAI / Renmin University (Zhang et al.)	61.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#66	MemoryBankInstitute of Software, Chinese Academy of Sciences	59.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#67	StreamingLLMMIT Han Lab / Meta AI (Xiao et al.)	59.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#68	Heyday AIHeyday (shut down 2025)	58.7	Estimated	Arena estimate — derived from capability profile, not independently verified