Back to AI Research

AI Research

Knowledge Index of Noah's Ark | AI Research

Key Takeaways

  • Knowledge Index of Noah's Ark The Knowledge Index of Noah's Ark (KINA) is a new benchmark designed to address systemic flaws in how Large Language Models (LL...
  • We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results.
  • Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1).
  • Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation.
  • Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models.
Paper AbstractExpand

Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.

Knowledge Index of Noah's Ark
The Knowledge Index of Noah's Ark (KINA) is a new benchmark designed to address systemic flaws in how Large Language Models (LLMs) are tested. Current benchmarks often suffer from poor disciplinary coverage, unreliable human review processes, and unstable rankings that change significantly when test sets are slightly adjusted. KINA introduces a rigorous 899-item dataset across 261 fine-grained disciplines, supported by formal mathematical guarantees for how questions are selected and how human reviewers are incentivized to provide high-quality feedback.

Ensuring Disciplinary Coverage

To ensure the benchmark accurately represents a wide range of knowledge, the authors moved away from simply collecting questions based on availability. Instead, they defined a "disciplinary prototype" for each field—a collection of core concepts, theorems, and problems identified by experts. They then used a greedy selection algorithm to choose items that best cover these core anchors. This approach provides a formal guarantee that the selected questions are representative of their respective fields, ensuring the benchmark acts as a meaningful diagnostic tool rather than just a random collection of difficult questions.

Improving Reviewer Quality

A major challenge in building benchmarks is preventing "lazy consensus," where human reviewers approve items without putting in the necessary effort. KINA replaces standard flat-payment models with a "bonus-on-bar tournament." In this system, two reviewers evaluate each item independently, and the one who provides the higher-quality, verified score receives a bonus. This creates a competitive incentive for accuracy. The authors prove that this mechanism is more effective than flat payment at eliciting high-effort reviews, and they include stochastic audits to further discourage collusion or low-quality work.

Performance and Stability

When evaluating 42 models from 13 different labs, the researchers found that the field is far from saturation. The top-performing model, Gemini-3.1-Pro-Preview, achieved an accuracy of 53.17%, while other frontier models followed closely. The leaderboard reveals a tiered structure: a small group of top-tier models, a dense cluster of strong models, and a lower tier that performs only slightly better than random guessing.
The authors also emphasize the importance of ranking stability. Because small changes in a test set can shift model rankings, they provide bootstrap-based statistics for every result. This allows users to see the variance in performance, discouraging the common practice of over-interpreting minor differences between models that are ranked adjacently.

Key Considerations

While KINA provides a more robust framework than many existing benchmarks, it is important to note that the formal guarantees apply to the proxy objectives used by the researchers—such as the "support centrality" of questions—rather than guaranteeing perfect, universal representativeness. Additionally, while tool-use (such as web search) provided a performance boost of up to 5.17 points, these gains varied significantly between models. The authors provide the full dataset, reviewer manuals, and evaluation code to encourage transparency and allow for future replication and extension of their work.

Comments (0)

No comments yet

Be the first to share your thoughts!