∞Bench

Name: InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens
Creator: Tsinghua / OpenBMB
Keywords: long-context-retrieval, long-context

InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens

Benchmark Metadata

PublisherTsinghua / OpenBMB

VenueACL 2024

Evaluation Typeautomatic

Dimensions12

Test Prompts3,946

ScoringHigher is better

Update Frequencyannual

PaperView Paper

LeaderboardView Leaderboard

What It Measures

Retrieval at 100k+ tokens
Math and code over long contexts
Novel and dialogue QA
Key-value retrieval
Summarization over book-length input

What It Does Not Measure

Multi-session memory
Personalization
Real-time latency

All Systems Evaluated(32 systems)

2 self-reported30 estimated

Rank	System	Score	Provenance	Source
#1	EM-LLMem-llm (academic consortium)	96.7	Self-Reported	arXiv:2407.09450 Table 1 — EM-LLM (SM) on LLaMA 3.1-8B; avg of R.KV 90.2, R.PassKey 100, R.Number 100
#2	Titanslucidrains (community) / paper by Google Research	87.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#3	LM-InfiniteIllinois / Meta (Han et al.)	85	Estimated	Arena estimate — derived from capability profile, not independently verified
#4	MambaCMU / Princeton (Gu, Dao)	85	Estimated	Arena estimate — derived from capability profile, not independently verified
#5	ScissorhandsRice / Stanford / Meta (Liu et al.)	84.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#6	Jina AI EmbeddingsJina AI GmbH	83.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#7	Compressive TransformerDeepMind (Rae et al.)	82.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#8	R3MemHKUST (2025)	81.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#9	Memorizing TransformerGoogle Research (Wu, Rabe, Hutchins, Szegedy)	80.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#10	MemoryLLMUCSD / Apple (Wang et al.)	80	Estimated	Arena estimate — derived from capability profile, not independently verified
#11	Landmark AttentionEPFL (Mohtashami, Jaggi)	79.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#12	TRIMEPrinceton NLP (Zhong, Lei, Chen)	79.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#13	GAMVectorSpaceLab (BAAI-related)	79	Estimated	Arena estimate — derived from capability profile, not independently verified
#14	RAPTORStanford (Sarthi, Abdullah et al.)	79	Estimated	Arena estimate — derived from capability profile, not independently verified
#15	LongMemUCSB / Microsoft Research	78.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#16	MemformerUC Santa Barbara / Amazon (Wu, Lan, Liu, et al.)	78.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#17	∞ FormerInstituto de Telecomunicações / DeepMind / IST (Martins, Marinho, Martins)	78.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#18	ICAEMicrosoft Research (Ge et al.)	77.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#19	Recurrent Memory TransformerMIPT / DeepPavlov (Bulatov, Kuratov, Burtsev)	77.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#20	Activation BeaconBAAI / Renmin University (Zhang et al.)	77.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#21	HEMAindependent (Ahn et al.)	77.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#22	Memory³Institute for Advanced Algorithms Research Shanghai / Peking University	77.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#23	H2OUT Austin / Rice / CMU / Stanford / Meta (Zhang et al.)	76.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#24	SCMBeihang / NLPR (Wang et al.)	76.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#25	ReMeModelScope (Alibaba)	75	Estimated	Arena estimate — derived from capability profile, not independently verified
#26	Adept AIAdept AI Labs (acquired by Amazon 2024)	73.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#27	RWKVRWKV Foundation / BlinkDL community	72.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#28	AgentScopeModelScope (Alibaba)	70.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#29	LanceDBLanceDB Inc. (YC S22)	70.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#30	StreamingLLMMIT Han Lab / Meta AI (Xiao et al.)	70.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#31	Activeloop Deep LakeActiveloop Inc.	69.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#32	MemoRAGBAAI / Qhjqhj00	24.5	Self-Reported	arXiv:2409.05591 Table 1 — avg of MultiNews 26.3, GovReport 32.9, En.SUM 15.7, En.QA 22.9