Every Eval Ever is a project designed to solve the fragmentation and inconsistency currently plaguing AI evaluation. As researchers and developers test new models, they often save their results in incompatible formats—ranging from blog posts and leaderboards to custom log files—making it nearly impossible to compare performance across different studies. This paper introduces a unified, community-governed schema and a crowdsourced repository that standardizes how these evaluation results are recorded, stored, and shared, ultimately making AI progress easier to track and analyze.
A Universal Language for Evaluation
The core of the project is a standardized JSON schema that acts as a common language for AI evaluation data. Instead of just recording a final score, this schema captures the essential context behind the result: who performed the evaluation, what generation settings were used, how the model was accessed, and the specific metrics applied. By creating a consistent structure, the schema allows researchers to compare results from different sources—such as academic papers, evaluation harnesses, and public leaderboards—on equal footing. It also supports optional "sidecar" files that store instance-level data, such as specific prompts and model outputs, which are vital for deep-dive analysis.
Streamlining Data Collection
To lower the barrier for adoption, the project provides automatic converters for popular evaluation frameworks like HELM, lm-eval-harness, and Inspect AI. These tools translate existing, messy log files into the new, standardized format. To ensure quality, the repository includes a validation pipeline that checks every contribution for schema compliance before it is accepted. This process ensures that the data remains clean and usable for the entire community.
Building a Community Repository
The project hosts a crowdsourced database on Hugging Face that serves as a central hub for evaluation results. As of the study's publication, this repository already contains data for over 22,000 models and 2,200 benchmarks across 31 different formats. By pooling these results, the community can conduct large-scale meta-analyses that were previously impossible, such as identifying cost-accuracy trade-offs, detecting reproducibility gaps, and performing item-level analysis.
Improving Transparency and Reproducibility
By standardizing reporting, Every Eval Ever addresses the "hidden" variables that often lead to conflicting scores for the same model. For example, the project highlights how different inference engines or generation parameters can lead to divergent results, even when the benchmark name is identical. By making these details explicit rather than implicit, the repository helps researchers avoid the high costs of re-running evaluations and provides a more reliable foundation for understanding AI capabilities and risks. The project operates under a community-governed model, ensuring that the schema remains flexible and evolves to meet the needs of researchers as new AI technologies emerge.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!