Back to Arena
StreamingLLM
by MIT Han Lab / Meta AI (Xiao et al.)
System Card
OrganizationMIT Han Lab / Meta AI (Xiao et al.)
Released2023-09
Architecturekv-cache-extension / Attention sinks + sliding window KV cache
DetailsDiscovers the "attention sink" phenomenon: retaining initial tokens' KVs recovers window-attention performance. Combines sink tokens with a sliding window for stable infinite-length streaming at no training cost.
Parameters—
Domainlong-context
Open SourceYes
PaperView Paper
WebsiteVisit
CodeRepository
iclr-2024attention-sinkstreaminginfinite
Capability Profile
Benchmark Scores
7 of 14 benchmarksMulti-Turn Recall0/2
LoCoMo
no dataMemoryBank
no dataCross-Session Memory0/1
LongMemEval
no dataAgent Task Memory0/1
AgentBench-Mem
no dataPersonalization0/1
PerLTQA
no dataFactuality / Grounding0/1
RAGAS
no dataSources:arXiv:2309.17453 Table 8 — StreamingLLM 1750+1750 on Llama2-7B-chat; avg of NarrativeQA 18.2, Qasper 19.7, HotpotQA 24.9, 2WikiMQA 32.0, GovReport 26.3, MultiNews 25.9arXiv:2309.17453 Table 8 — StreamingLLM 1750+1750 on HotpotQA subset of LongBenchStreamingLLM paper (arXiv:2309.17453); evaluated on BABILong: Testing the Limits of LLMs with Long-Context Reasoning-in-a-Haystack (AIRI, 2406)StreamingLLM paper (arXiv:2309.17453); evaluated on InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens (Tsinghua / OpenBMB, 2402)StreamingLLM paper (arXiv:2309.17453); evaluated on LooGLE: Can Long-Context Language Models Understand Long Contexts? (Peking University, 2311)StreamingLLM paper (arXiv:2309.17453); evaluated on Needle in a Haystack (Greg Kamradt, 2024)StreamingLLM paper (arXiv:2309.17453); evaluated on RULER: What's the Real Context Size of Your Long-Context Language Models (NVIDIA, 2404)