Knowledge Index of Noah's Ark
The Knowledge Index of Noah's Ark (KINA) is a new benchmark designed to address systemic flaws in how Large Language Models (LLMs) are tested. Current benchmarks often suffer from poor disciplinary coverage, unreliable human review processes, and unstable rankings that change significantly when test sets are slightly adjusted. KINA introduces a rigorous 899-item dataset across 261 fine-grained disciplines, supported by formal mathematical guarantees for how questions are selected and how human reviewers are incentivized to provide high-quality feedback.
Ensuring Disciplinary Coverage
To ensure the benchmark accurately represents a wide range of knowledge, the authors moved away from simply collecting questions based on availability. Instead, they defined a "disciplinary prototype" for each field—a collection of core concepts, theorems, and problems identified by experts. They then used a greedy selection algorithm to choose items that best cover these core anchors. This approach provides a formal guarantee that the selected questions are representative of their respective fields, ensuring the benchmark acts as a meaningful diagnostic tool rather than just a random collection of difficult questions.
Improving Reviewer Quality
A major challenge in building benchmarks is preventing "lazy consensus," where human reviewers approve items without putting in the necessary effort. KINA replaces standard flat-payment models with a "bonus-on-bar tournament." In this system, two reviewers evaluate each item independently, and the one who provides the higher-quality, verified score receives a bonus. This creates a competitive incentive for accuracy. The authors prove that this mechanism is more effective than flat payment at eliciting high-effort reviews, and they include stochastic audits to further discourage collusion or low-quality work.
Performance and Stability
When evaluating 42 models from 13 different labs, the researchers found that the field is far from saturation. The top-performing model, Gemini-3.1-Pro-Preview, achieved an accuracy of 53.17%, while other frontier models followed closely. The leaderboard reveals a tiered structure: a small group of top-tier models, a dense cluster of strong models, and a lower tier that performs only slightly better than random guessing.
The authors also emphasize the importance of ranking stability. Because small changes in a test set can shift model rankings, they provide bootstrap-based statistics for every result. This allows users to see the variance in performance, discouraging the common practice of over-interpreting minor differences between models that are ranked adjacently.
Key Considerations
While KINA provides a more robust framework than many existing benchmarks, it is important to note that the formal guarantees apply to the proxy objectives used by the researchers—such as the "support centrality" of questions—rather than guaranteeing perfect, universal representativeness. Additionally, while tool-use (such as web search) provided a performance boost of up to 5.17 points, these gains varied significantly between models. The authors provide the full dataset, reviewer manuals, and evaluation code to encourage transparency and allow for future replication and extension of their work.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!