Google DeepMind introduced the FACTS Grounding benchmark to evaluate large language models' ability to generate factually accurate responses grounded in extensive input contexts. This bench…
Google DeepMind introduced the FACTS Grounding benchmark to evaluate large language models' ability to generate factually accurate responses grounded in extensive input contexts. This benchmark addresses the critical challenge of LLMs producing "hallucinated" content, which undermines trust and limits their applications.
The dataset features user requests paired with source documents up to 32,000 tokens, requiring responses that strictly adhere to the provided context. The evaluation process involves a two-stage approach, first screening responses for eligibility, then assessing factuality using multiple automated judge models like Gemini 1.5 Pro and GPT-4o.
The benchmark uses span-level analysis to validate claims and aggregates scores across models to minimize bias, aiming to provide a robust and scalable framework for enhancing LLM factuality.