Back to Benchmarks

NIAH

Needle in a Haystack

Benchmark Metadata

PublisherGreg Kamradt
VenueOpen-source benchmark
Evaluation Typeautomatic
Dimensions1
Test Prompts200
ScoringHigher is better
Update Frequencyad-hoc
LeaderboardView Leaderboard

What It Measures

  • Exact recall of a planted fact across context positions
  • Recall vs depth heatmap
  • Recall vs context length

What It Does Not Measure

  • Multi-hop reasoning
  • Conversational memory
  • Knowledge updates
  • Personalization

All Systems Evaluated(13 systems)

RankSystemScore
#1Titanslucidrains (community) / paper by Google Research98.8
#2Landmark AttentionEPFL (Mohtashami, Jaggi)77.5
#3LM-InfiniteIllinois / Meta (Han et al.)77.2
#4H2OUT Austin / Rice / CMU / Stanford / Meta (Zhang et al.)76.1
#5Recurrent Memory TransformerMIPT / DeepPavlov (Bulatov, Kuratov, Burtsev)75.9
#6Compressive TransformerDeepMind (Rae et al.)75.1
#7ICAEMicrosoft Research (Ge et al.)75.1
#8MambaCMU / Princeton (Gu, Dao)75.1
#9∞ FormerInstituto de Telecomunicações / DeepMind / IST (Martins, Marinho, Martins)73
#10ScissorhandsRice / Stanford / Meta (Liu et al.)72.2
#11RWKVRWKV Foundation / BlinkDL community71.2
#12Activation BeaconBAAI / Renmin University (Zhang et al.)63.1
#13StreamingLLMMIT Han Lab / Meta AI (Xiao et al.)60