HotpotQA

Name: HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Creator: Stanford / CMU
Keywords: multi-hop-qa, rag-retrieval, knowledge-graph

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Benchmark Metadata

PublisherStanford / CMU

VenueEMNLP 2018

Evaluation Typeautomatic

Dimensions2

Test Prompts7,405

ScoringHigher is better

Update Frequencyannual

PaperView Paper

LeaderboardView Leaderboard

What It Measures

Multi-hop answer exact match and F1
Supporting-fact prediction F1

What It Does Not Measure

Cross-session memory
Personalization
Long context retrieval

All Systems Evaluated(187 systems)

20 self-reported167 estimated

Rank	System	Score	Provenance	Source
#1	Backboard IOBackboard.io	82	Estimated	Arena estimate — derived from capability profile, not independently verified
#2	MemR32025 (December submission)	82	Estimated	Arena estimate — derived from capability profile, not independently verified
#3	xmemoryxmemory Inc.	82	Estimated	Arena estimate — derived from capability profile, not independently verified
#4	VoyagerNVIDIA / Caltech / UT Austin / Stanford / ASU / UW (Wang et al.)	81.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#5	HuggingGPT / JARVISMicrosoft Research	80.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#6	MCP Memory ServerAnthropic / Model Context Protocol	80.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#7	ReflexionNortheastern / MIT / Princeton (Shinn et al.)	80	Self-Reported	arXiv:2303.11366 Figure 4c / Table 5 — CoT+Reflexion with GPT-4 and GOLD context (not retrieval). Reading comprehension setting.
#8	MIRIXMIRIX AI (Wang, Chen)	79.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#9	Generative AgentsStanford University / Google Research	79.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#10	AutoGPT PlatformSignificant Gravitas	79.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#11	D-MemYou et al. (2025)	79.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#12	Onyxonyx-dot-app	79	Estimated	Arena estimate — derived from capability profile, not independently verified
#13	SupermemorySupermemory	78.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#14	SCMBeihang / NLPR (Wang et al.)	78.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#15	HiMemZhu et al. (JD.com, 2026)	78.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#16	DifyLangGenius	78	Estimated	Arena estimate — derived from capability profile, not independently verified
#17	ChatDBTsinghua University (Hu et al.)	77.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#18	NemoriNemori AI (independent)	77.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#19	Self-RAGUniversity of Washington / Allen AI (Asai et al.)	77.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#20	Athina AIAthina AI (YC W23)	77.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#21	WebVoyagerMinorJerry et al.	77.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#22	AgentVerseOpenBMB (Tsinghua)	77.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#23	CradleBAAI-Agents	77.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#24	CrewAICrewAI Inc. (Joao Moura)	77.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#25	Mobile-AgentAlibaba Tongyi Lab (X-PLUG)	77.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#26	RecallMCisco Research / independent (Kynoch & Latapie)	77.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#27	SuperAGITransformerOptimus	77.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#28	SynapseNanyang Technological University (Zheng et al.)	77.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#29	SID AISID (YC)	77.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#30	A-MEMAGI Research / Rutgers	77	Estimated	Arena estimate — derived from capability profile, not independently verified
#31	DiffbotDiffbot Inc.	76.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#32	Neo4j LLM Graph BuilderNeo4j Labs	76.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#33	AutoGen StudioMicrosoft Research	76.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#34	HybridAGISynaLinks	76.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#35	Adept AIAdept AI Labs (acquired by Amazon 2024)	76.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#36	AppAgentTencent / mnotgod96	76.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#37	AutoGen Core MemoryMicrosoft	76.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#38	Galileo AIGalileo Technologies Inc.	76.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#39	KAGOpenSPG / Ant Group	76.2	Self-Reported	arXiv:2409.13731 Table 8 — F1 with LFSH_ref3 + DeepSeek-V2 backbone
#40	RAGFlowInfiniFlow	76.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#41	ReMeModelScope (Alibaba)	76.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#42	MarkerDatalab (datalab-to)	76.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#43	BotpressBotpress Inc.	76	Estimated	Arena estimate — derived from capability profile, not independently verified
#44	MultiOnMultiOn (now AGI Inc.)	75.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#45	CognigyCognigy GmbH (acquired by NICE, July 2025)	75.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#46	OS-Copilot / FRIDAYShanghai AI Lab / MMLab (Wu et al.)	75.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#47	CAMELCAMEL-AI.org	75.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#48	HebbiaHebbia, Inc.	75.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#49	HippoRAG 2OSU NLP Group	75.5	Self-Reported	arXiv:2502.14802 Table 2 — F1 with Llama-3.3-70B-Instruct backbone
#50	AgentScopeModelScope (Alibaba)	75.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#51	BrowserGymServiceNow Research	75.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#52	GleanGlean Technologies	75.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#53	TrustRAGGoMate Community	75.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#54	AllegroGraphFranz Inc.	75.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#55	Vellum AIVellum AI Inc. (YC W23)	75.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#56	Bishengdataelement	74.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#57	MoTFudan University (Li & Qiu)	74.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#58	VoiceflowVoiceflow Inc.	74.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#59	ArcMemoUC Berkeley / Stanford (Ho et al.)	74.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#60	HoneyHiveHoneyHive Inc.	74.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#61	MetaGPTDeepWisdom / geekan	74.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#62	AutoWebGLMTHUDM	74.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#63	Swarmskyegomez / Swarms Corp	74.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#64	FastGPTlabring	74.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#65	Maxim AIMaxim AI Inc.	74.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#66	Stack AIStack AI Inc. (YC W23)	74	Estimated	Arena estimate — derived from capability profile, not independently verified
#67	RMMGoogle / UCSB (2025)	73.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#68	Nano GraphRAGgusye1234	73.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#69	Agent Workflow MemoryCMU (Wang, Mao, Fried, Neubig)	73.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#70	MiniRAGHKUDS	73.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#71	Kore AIKore.ai Inc.	73	Estimated	Arena estimate — derived from capability profile, not independently verified
#72	CrewAI EnterpriseCrewAI Inc.	72.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#73	PaperQA2FutureHouse	72.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#74	LagentInternLM (Shanghai AI Lab)	72.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#75	LangSmith LangGraph CloudLangChain Inc.	72.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#76	Lindy AILindy AI	72.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#77	Open InterpreterOpenInterpreter	72.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#78	BabyAGIYohei Nakajima	72.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#79	DB-GPTeosphoros-ai	72.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#80	Dust ttDust (formerly XP1)	72.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#81	AGiXTJosh-XT	72.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#82	MempZhejiang University (Fang et al.)	72.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#83	VectorShiftVectorShift Inc. (YC S23)	72.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#84	FlowiseFlowiseAI	72.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#85	Generative AgentsStanford / Google	72	Estimated	Arena estimate — derived from capability profile, not independently verified
#86	LangflowLangflow-ai (DataStax)	71.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#87	LightRAGHKUDS (HKU Data Intelligence Lab)	71.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#88	Ontotext GraphDBOntotext / Graphwise (merged with Semantic Web Company, 2025)	71.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#89	ChatDev 2.0OpenBMB	71.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#90	MemOSMemTensor (Li, Zhang, et al.)	71.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#91	LangGraphLangChain	71.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#92	CogneeCognee	71.5	Self-Reported	Cognee paper (arXiv:2505.24478) — correctness score
#93	GraphRAGMicrosoft	71.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#94	MoTFudan (Li, Qiu)	71.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#95	MemoryScopeAlibaba ModelScope	71.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#96	Qwen-AgentQwenLM (Alibaba)	70.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#97	JARVIS-1CraftJarvis	70.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#98	Claude ProjectsAnthropic	70.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#99	StardogStardog Union Inc.	70.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#100	Neo4j AuraDBNeo4j Inc.	70.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#101	GAMVectorSpaceLab (BAAI-related)	70.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#102	Memoripycaspianmoon	70.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#103	Think-in-MemoryAnt Group / Alibaba (Liu et al.)	70.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#104	Cohere EmbedCohere Inc.	70	Estimated	Arena estimate — derived from capability profile, not independently verified
#105	PathRAGBUPT-GAMMA	70	Estimated	Arena estimate — derived from capability profile, not independently verified
#106	AriGraphAIRI Institute / Moscow	69.9	Self-Reported	github.com/AIRI-Institute/AriGraph README (arXiv:2407.04363 transfer) — F1 with GPT-4; EM 59.5; 200 test samples
#107	LarimarIBM Research	69.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#108	GraphRAG-SDKFalkorDB	69.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#109	kNN-LMStanford / Facebook AI Research (Khandelwal et al.)	69.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#110	R2RSciPhi-AI	69.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#111	Voyage AIVoyage AI (acquired by MongoDB, Feb 2025)	69.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#112	GPTeam101dotxyz	68.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#113	Mixedbread AIMixedbread AI	68.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#114	Granola AIGranola	68.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#115	MemformerUC Santa Barbara / Amazon (Wu, Lan, Liu, et al.)	68	Estimated	Arena estimate — derived from capability profile, not independently verified
#116	LangMemLangChain	67.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#117	MemoChatUniversity of Warwick / Alibaba	67.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#118	LlamaIndex MemoryLlamaIndex	67.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#119	SynapseNTU / Salesforce (Zheng et al.)	67.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#120	TRIMEPrinceton NLP (Zhong, Lei, Chen)	67.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#121	Heyday AIHeyday (shut down 2025)	67.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#122	HEMAindependent (Ahn et al.)	66.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#123	Astra DBDataStax	66.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#124	Sana AISana Labs	66.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#125	Jina AI EmbeddingsJina AI GmbH	66.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#126	Nomic AtlasNomic AI Inc.	66	Estimated	Arena estimate — derived from capability profile, not independently verified
#127	Neon VectorNeon Inc.	65.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#128	Notion AINotion Labs	65.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#129	EpsillaEpsilla Inc. (YC S23)	65.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#130	MemoriGibsonAI	65	Estimated	Arena estimate — derived from capability profile, not independently verified
#131	RAPTORStanford (Sarthi, Abdullah et al.)	65	Estimated	Arena estimate — derived from capability profile, not independently verified
#132	Mnemosyneindependent	64.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#133	QuivrQuivrHQ	64.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#134	MendableMendable (YC-backed)	64.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#135	pgvector Supabase Neonpgvector OSS / Supabase Inc. / Neon Inc.	64.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#136	REALMGoogle Research (Guu et al.)	64.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#137	Supabase VectorSupabase Inc.	64	Estimated	Arena estimate — derived from capability profile, not independently verified
#138	RagieRagie Inc.	63.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#139	CognitaTrueFoundry	63.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#140	KDB AIKX Systems	63.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#141	MongoDB Atlas VectorMongoDB Inc.	63.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#142	ColPaliilluin-tech	63.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#143	RETRODeepMind (Borgeaud et al.)	62.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#144	Elasticsearch VectorElastic N.V.	62.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#145	Manticore SearchManticore Software Ltd.	62.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#146	MemoryBankHarbin Institute of Technology / SenseTime	62.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#147	SelfmemTsinghua / Microsoft (Cheng et al.)	62.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#148	Couchbase VectorCouchbase Inc.	62.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#149	OpenSearch VectorOpenSearch Project (AWS-led)	62.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#150	vectorizeVectorize Inc.	62.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#151	GraphitiZep AI	62.1	Self-Reported	Zep / Graphiti paper
#152	Vespa AIYahoo / Vespa.ai (independent OSS project)	62.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#153	MyScaleMyScale Inc.	62	Estimated	Arena estimate — derived from capability profile, not independently verified
#154	SingleStore VectorSingleStore Inc.	61.9	Estimated	Arena estimate — derived from capability profile, not independently verified
#155	MemoroMIT Media Lab	61.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#156	LanceDBLanceDB Inc. (YC S22)	61.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#157	VerbaWeaviate	61.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#158	AnythingLLMMintplex Labs	61.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#159	PrivateGPTZylon AI	60.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#160	Activeloop Deep LakeActiveloop Inc.	60.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#161	LlamaCloudLlamaIndex Inc.	60.5	Estimated	Arena estimate — derived from capability profile, not independently verified
#162	MarqoMarqo Pty Ltd	60.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#163	Saner AISaner.AI	60.4	Estimated	Arena estimate — derived from capability profile, not independently verified
#164	R3MemHKUST (2025)	60.3	Estimated	Arena estimate — derived from capability profile, not independently verified
#165	ParadeDBParadeDB Inc. (YC S23)	60.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#166	Redis VectorRedis Ltd.	60.2	Estimated	Arena estimate — derived from capability profile, not independently verified
#167	Carbon AICarbon (acquired by Perplexity, Dec 2024)	60	Estimated	Arena estimate — derived from capability profile, not independently verified
#168	ValdYahoo Japan	60	Estimated	Arena estimate — derived from capability profile, not independently verified
#169	Unstructured IOUnstructured Technologies Inc.	59.7	Estimated	Arena estimate — derived from capability profile, not independently verified
#170	HippoRAGOSU NLP Group (Ohio State University)	59.2	Self-Reported	arXiv:2405.14831 Table 4 — F1 with IRCoT+HippoRAG (ColBERTv2); EM 45.7. HippoRAG alone: F1 55.0
#171	Mem AIMem Labs	59.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#172	TurboPufferTurboPuffer Inc.	59.1	Estimated	Arena estimate — derived from capability profile, not independently verified
#173	VectaraVectara Inc.	58.8	Estimated	Arena estimate — derived from capability profile, not independently verified
#174	Memory³Institute for Advanced Algorithms Research Shanghai / Peking University	57.6	Estimated	Arena estimate — derived from capability profile, not independently verified
#175	MemoRAGBAAI / Qhjqhj00	54.8	Self-Reported	arXiv:2409.05591 Table 1 — MemoRAG on HotpotQA (via LongBench)
#176	MemaryKingjulio8238	54.2	Self-Reported	Memary repo evals
#177	WeaviateWeaviate	53.1	Self-Reported	Weaviate evals
#178	MilvusZilliz	52.6	Self-Reported	Milvus benchmark
#179	PineconePinecone Systems	52.4	Self-Reported	Pinecone evals
#180	QdrantQdrant	51.8	Self-Reported	Qdrant evals
#181	Haystack Memorydeepset	51.2	Self-Reported	Haystack benchmark
#182	AtlasMeta AI FAIR (Izacard et al.)	50.6	Self-Reported	arXiv:2208.03299 Table 10 — KILT-filtered HotpotQA EM, full-train; 64-shot EM=34.7
#183	ChromaChroma	49.7	Self-Reported	Chroma benchmark
#184	txtaiNeuML	49.5	Self-Reported	txtai benchmark
#185	KnowAgentzjunlp (Zhejiang University)	48.1	Self-Reported	arXiv:2403.03101 Table 1 — KnowAgent-70b (Llama-2-70b-chat) F1 averaged across Easy/Medium/Hard
#186	ExpeLTsinghua University (Zhao et al.)	39	Self-Reported	arXiv:2308.10144 Figure 5 — Success rate read from Figure 5; not a precise table cell
#187	StreamingLLMMIT Han Lab / Meta AI (Xiao et al.)	24.9	Self-Reported	arXiv:2309.17453 Table 8 — StreamingLLM 1750+1750 on HotpotQA subset of LongBench