Back to Benchmarks
LooGLE
LooGLE: Can Long-Context Language Models Understand Long Contexts?
Benchmark Metadata
PublisherPeking University
VenueACL 2024
Evaluation Typeautomatic
Dimensions7
Test Prompts776
ScoringHigher is better
Update Frequencyannual
PaperView Paper
LeaderboardView Leaderboard
What It Measures
- Short and long-dependency QA
- Summarization over long documents
- Computation across documents
- Timeline reorder
- Multi-information retrieval
What It Does Not Measure
- Multi-session memory
- Latency
- Personalization
All Systems Evaluated(22 systems)
| Rank | System | Score |
|---|---|---|
| #1 | Titanslucidrains (community) / paper by Google Research | 81.3 |
| #2 | MambaCMU / Princeton (Gu, Dao) | 81.1 |
| #3 | Landmark AttentionEPFL (Mohtashami, Jaggi) | 80.4 |
| #4 | MemoryLLMUCSD / Apple (Wang et al.) | 79.2 |
| #5 | Recurrent Memory TransformerMIPT / DeepPavlov (Bulatov, Kuratov, Burtsev) | 79 |
| #6 | Jina AI EmbeddingsJina AI GmbH | 78.9 |
| #7 | ICAEMicrosoft Research (Ge et al.) | 78.7 |
| #8 | ∞ FormerInstituto de Telecomunicações / DeepMind / IST (Martins, Marinho, Martins) | 77.9 |
| #9 | LM-InfiniteIllinois / Meta (Han et al.) | 77.7 |
| #10 | Compressive TransformerDeepMind (Rae et al.) | 77.6 |
| #11 | Memorizing TransformerGoogle Research (Wu, Rabe, Hutchins, Szegedy) | 77.5 |
| #12 | RAPTORStanford (Sarthi, Abdullah et al.) | 77.1 |
| #13 | H2OUT Austin / Rice / CMU / Stanford / Meta (Zhang et al.) | 75.9 |
| #14 | TRIMEPrinceton NLP (Zhong, Lei, Chen) | 75.4 |
| #15 | Memory³Institute for Advanced Algorithms Research Shanghai / Peking University | 74.5 |
| #16 | RWKVRWKV Foundation / BlinkDL community | 73.9 |
| #17 | LongMemUCSB / Microsoft Research | 73.1 |
| #18 | ScissorhandsRice / Stanford / Meta (Liu et al.) | 72.9 |
| #19 | Activation BeaconBAAI / Renmin University (Zhang et al.) | 68.6 |
| #20 | StreamingLLMMIT Han Lab / Meta AI (Xiao et al.) | 66.9 |
| #21 | Activeloop Deep LakeActiveloop Inc. | 61.3 |
| #22 | LanceDBLanceDB Inc. (YC S22) | 59.9 |