Back to Benchmarks
NIAH
Needle in a Haystack
Benchmark Metadata
PublisherGreg Kamradt
VenueOpen-source benchmark
Evaluation Typeautomatic
Dimensions1
Test Prompts200
ScoringHigher is better
Update Frequencyad-hoc
PaperView Paper
LeaderboardView Leaderboard
What It Measures
- Exact recall of a planted fact across context positions
- Recall vs depth heatmap
- Recall vs context length
What It Does Not Measure
- Multi-hop reasoning
- Conversational memory
- Knowledge updates
- Personalization
All Systems Evaluated(13 systems)
| Rank | System | Score |
|---|---|---|
| #1 | Titanslucidrains (community) / paper by Google Research | 98.8 |
| #2 | Landmark AttentionEPFL (Mohtashami, Jaggi) | 77.5 |
| #3 | LM-InfiniteIllinois / Meta (Han et al.) | 77.2 |
| #4 | H2OUT Austin / Rice / CMU / Stanford / Meta (Zhang et al.) | 76.1 |
| #5 | Recurrent Memory TransformerMIPT / DeepPavlov (Bulatov, Kuratov, Burtsev) | 75.9 |
| #6 | Compressive TransformerDeepMind (Rae et al.) | 75.1 |
| #7 | ICAEMicrosoft Research (Ge et al.) | 75.1 |
| #8 | MambaCMU / Princeton (Gu, Dao) | 75.1 |
| #9 | ∞ FormerInstituto de Telecomunicações / DeepMind / IST (Martins, Marinho, Martins) | 73 |
| #10 | ScissorhandsRice / Stanford / Meta (Liu et al.) | 72.2 |
| #11 | RWKVRWKV Foundation / BlinkDL community | 71.2 |
| #12 | Activation BeaconBAAI / Renmin University (Zhang et al.) | 63.1 |
| #13 | StreamingLLMMIT Han Lab / Meta AI (Xiao et al.) | 60 |