Back to Arena
Unstructured IO
by Unstructured Technologies Inc.
System Card
OrganizationUnstructured Technologies Inc.
Released2022-09
Architecturevector-rag / document preprocessing and ETL for RAG
DetailsUnstructured provides open-source and managed APIs for extracting, partitioning, and transforming unstructured content (PDFs, DOCX, HTML, images) into clean JSON chunks suitable for downstream embedding and vector storage. Uses computer vision (layout detection) and NLP for high-fidelity extraction from complex document layouts. Integrates natively with LangChain, LlamaIndex, and all major vector stores.
Parameters—
Domainrag-retrieval
Open SourcePartial
WebsiteVisit
document-parsingETLPDF-extractionlayout-detectionpre-processing
Capability Profile
Benchmark Scores
5 of 14 benchmarksMulti-Turn Recall0/2
LoCoMo
no dataMemoryBank
no dataCross-Session Memory0/1
LongMemEval
no dataMulti-Hop QA2/3
Agent Task Memory0/1
AgentBench-Mem
no dataPersonalization0/1
PerLTQA
no dataFactuality / Grounding1/1
Sources:Unstructured IO vendor documentation; evaluated on HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering (Stanford / CMU, 1809)Unstructured IO vendor documentation; evaluated on LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding (Tsinghua KEG, 2308)Unstructured IO vendor documentation; evaluated on MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries (HKUST, 2401)Unstructured IO vendor documentation; evaluated on RAGAS: Automated Evaluation of Retrieval-Augmented Generation (Exploding Gradients, 2309)Unstructured IO vendor documentation; evaluated on RULER: What's the Real Context Size of Your Long-Context Language Models (NVIDIA, 2404)