Back to AI Research

AI Research

GIM: Evaluating models via tasks that integrate mul... | AI Research

Key Takeaways

  • The Grounded Integration Measure (GIM) is a new benchmark designed to evaluate Large Language Models (LLMs) by focusing on how well they coordinate multiple...
  • As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI).
  • The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters.
  • Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria).
  • A balanced public--private split provides built-in contamination diagnostic.
Paper AbstractExpand

As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public--private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.

The Grounded Integration Measure (GIM) is a new benchmark designed to evaluate Large Language Models (LLMs) by focusing on how well they coordinate multiple cognitive skills simultaneously. As existing benchmarks have become saturated, the field has split between testing for obscure, specialized knowledge or abstract, synthetic reasoning. GIM takes a third path: it uses 820 expert-authored problems that require models to integrate everyday knowledge with complex operations like state tracking, constraint satisfaction, and epistemic vigilance. By doing so, the benchmark tests a model’s practical reasoning capabilities in realistic contexts without requiring specialized expertise.

A New Approach to Evaluation

GIM moves away from simple binary scoring, where a model is either right or wrong. Instead, it uses rubric-decomposed scoring, where each problem is broken down into multiple independently judged criteria. This allows the benchmark to reward partial credit and provide a more granular view of a model's performance. To ensure the results are robust, the authors use Item Response Theory (IRT), a statistical framework that estimates a model's "ability" based on its performance across a variety of tasks. This method is particularly effective at handling missing data—such as when a model fails to return a response due to technical issues—ensuring that the final rankings are not unfairly skewed by infrastructure errors.

Measuring Reasoning and Compute

A significant portion of the research examines the relationship between "test-time compute"—the amount of internal processing or "thinking" a model performs—and its overall capability. The study found that within-family configuration choices, such as the allocated thinking budget and quantization levels, are just as impactful as the choice of the model itself. While increasing the thinking budget generally leads to better performance, the researchers observed diminishing marginal returns at the highest levels. The study also included a pilot "centaur" study, where humans collaborated with AI, suggesting that human-in-the-loop systems can extract additional performance from frontier models, particularly in quantitative and spatial reasoning tasks.

Ensuring Quality and Integrity

To maintain the benchmark's integrity, the authors implemented several safeguards. All 820 problems were kept private during the initial evaluation phase to prevent data contamination. The dataset is now split into public and private sets to allow for ongoing, secure testing. Furthermore, the problems were created by subject-matter experts and underwent a rigorous two-round review process to ensure clarity, accuracy, and timelessness. By combining this high-quality dataset with a sophisticated statistical scoring model, GIM provides a stable and reproducible way to compare how different models and configurations handle complex, multi-faceted reasoning challenges.

Comments (0)

No comments yet

Be the first to share your thoughts!