Google DeepMind researchers have introduced FACTS Grounding, a new benchmark to evaluate the factual accuracy of large language models (LLMs) when responding to complex, detailed prompts ba…
Google DeepMind researchers have introduced FACTS Grounding, a new benchmark to evaluate the factual accuracy of large language models (LLMs) when responding to complex, detailed prompts based on long-form documents. The benchmark assesses if responses are both comprehensive and fully supported by the provided context, with models being penalized for unsupported or irrelevant claims.
A corresponding leaderboard, hosted on Kaggle, ranks models based on their factuality scores, with Gemini 2.0 Flash currently leading. The FACTS dataset includes diverse documents and user requests, judged by multiple LLMs to mitigate bias. This initiative aims to improve LLM reliability by focusing on grounding responses in provided information, recognizing that current pre-training methods don't directly optimize for factuality.