Every Eval Ever: A Unifying Schema and Community Re...

Every Eval Ever: A Unifying Schema and Community Re... | AI Research

Key Takeaways

Every Eval Ever is a project designed to solve the fragmentation and inconsistency currently plaguing AI evaluation.
AI evaluations are widely used for testing and understanding progress.
However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison.
First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories.
We introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results.

Paper AbstractExpand

AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse. We introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results. The schema standardizes how evaluations are represented in a unified, single JSON document. It is source-agnostic by design, ingesting results from evaluation harnesses and papers alike, and optionally stores per-instance outputs for fine-grained analysis. We contribute: (i) a community-governed metadata schema with a companion instance-level schema, the first standardization effort of its kind; (ii) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema; and (iii) a crowdsourced community database hosted on Hugging Face, currently spanning to date 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats.

Every Eval Ever is a project designed to solve the fragmentation and inconsistency currently plaguing AI evaluation. As researchers and developers test new models, they often save their results in incompatible formats—ranging from blog posts and leaderboards to custom log files—making it nearly impossible to compare performance across different studies. This paper introduces a unified, community-governed schema and a crowdsourced repository that standardizes how these evaluation results are recorded, stored, and shared, ultimately making AI progress easier to track and analyze.

A Universal Language for Evaluation

The core of the project is a standardized JSON schema that acts as a common language for AI evaluation data. Instead of just recording a final score, this schema captures the essential context behind the result: who performed the evaluation, what generation settings were used, how the model was accessed, and the specific metrics applied. By creating a consistent structure, the schema allows researchers to compare results from different sources—such as academic papers, evaluation harnesses, and public leaderboards—on equal footing. It also supports optional "sidecar" files that store instance-level data, such as specific prompts and model outputs, which are vital for deep-dive analysis.

Streamlining Data Collection

To lower the barrier for adoption, the project provides automatic converters for popular evaluation frameworks like HELM, lm-eval-harness, and Inspect AI. These tools translate existing, messy log files into the new, standardized format. To ensure quality, the repository includes a validation pipeline that checks every contribution for schema compliance before it is accepted. This process ensures that the data remains clean and usable for the entire community.

Building a Community Repository

The project hosts a crowdsourced database on Hugging Face that serves as a central hub for evaluation results. As of the study's publication, this repository already contains data for over 22,000 models and 2,200 benchmarks across 31 different formats. By pooling these results, the community can conduct large-scale meta-analyses that were previously impossible, such as identifying cost-accuracy trade-offs, detecting reproducibility gaps, and performing item-level analysis.

Improving Transparency and Reproducibility

By standardizing reporting, Every Eval Ever addresses the "hidden" variables that often lead to conflicting scores for the same model. For example, the project highlights how different inference engines or generation parameters can lead to divergent results, even when the benchmark name is identical. By making these details explicit rather than implicit, the repository helps researchers avoid the high costs of re-running evaluations and provides a more reliable foundation for understanding AI capabilities and risks. The project operates under a community-governed model, ensuring that the schema remains flexible and evolves to meet the needs of researchers as new AI technologies emerge.