Galileo AI

by Galileo Technologies Inc.

System Card

OrganizationGalileo Technologies Inc.

Released2021-06

Architectureagentic-workflow / LLM evaluation with Luna foundation model

DetailsGalileo's core differentiator is Luna, a family of compact Evaluation Foundation Models fine-tuned specifically for hallucination detection, toxicity, prompt security, and data leak detection. ChainPoll methodology achieves 85% correlation with human feedback. Luna replaces expensive LLM-as-judge calls with low-latency, low-cost specialized models. Also offers guardrails, data curation, and issue triage.

Parameters—

Domainrag-retrievalagent-memory

Open SourceNo

WebsiteVisit

Luna-EFMhallucination-detectionguardrailsChainPollobservability

Capability Profile

Benchmark Scores

6 of 14 benchmarks

Long-Context Retrieval

1/5

RULER

no data

NIAH

no data

LooGLE

no data

LongBench

603p

∞Bench

no data

Multi-Turn Recall

1/2

LoCoMo

76.866p

MemoryBank

no data

Cross-Session Memory

1/1

LongMemEval

78.970p

Multi-Hop QA

2/3

BABILong

no data

MultiHop-RAG

73.665p

HotpotQA

76.278p

Agent Task Memory

1/1

AgentBench-Mem

7226p

Personalization

0/1

PerLTQA

no data

Factuality / Grounding

0/1

RAGAS

no data

Sources:Galileo AI vendor documentation; evaluated on HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering (Stanford / CMU, 1809)Galileo AI vendor documentation; evaluated on MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries (HKUST, 2401)Galileo AI vendor documentation; evaluated on AgentBench Memory Track (Tsinghua KEG, 2308)Galileo AI vendor documentation; evaluated on LoCoMo: Long-Term Conversational Memory Benchmark (Snap Research, 2402)Galileo AI vendor documentation; evaluated on LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding (Tsinghua KEG, 2308)Galileo AI vendor documentation; evaluated on LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory (Salesforce AI Research, 2410)