Evaluating interactive web code generated by AI has become a major challenge. Current methods often rely on rigid checklists or human-judged leaderboards, which are either too inflexible to capture real-world user experiences or too slow and expensive to use during the development process. This paper introduces a new, automated evaluation system designed to mimic the reasoned, holistic judgment of a human reviewer without needing reference implementations or pre-authored test scripts.
A New Evaluation Regime
The researchers propose a two-part solution: Cookie-Bench and Cookie. Cookie-Bench is a comprehensive benchmark consisting of 1,000 queries across 11 domains, covering both static web pages and complex, interactive applications. To ensure the benchmark tests actual capability rather than memorization, the queries are designed to resist recall from existing datasets and are balanced across different difficulty levels and languages.
How the Evaluation Works
The evaluation tool, called Cookie, functions like a human reviewer by separating the gathering of evidence from the final judgment. It operates in three distinct stages:
Static Perception: The system loads the generated web page and forms a "first impression" by analyzing screenshots and runtime logs.
Agent-Driven Interaction: An autonomous agent explores the page, clicking buttons and navigating flows. During this process, the system captures continuous video, audio, and interaction traces to see how the page behaves in practice.
Dynamic Scoring: Only after all evidence is collected does the system issue a final verdict. It evaluates the page on functionality and aesthetics, providing specific reasons for any failures identified during the interaction.
Performance of Frontier Models
The researchers tested 13 frontier LLMs using this system, comparing their performance when generating code through an agent-based scaffold versus direct HTML chat. The results revealed significant performance gaps between models, with the agent-based scaffold helping to raise the performance floor for weaker models. The evaluation also highlighted that while many models handle static layouts well, they struggle more with the complex event wiring and state management required for interactive, dynamic web applications.
Why This Matters
By removing the need for human-authored checklists or ground-truth references, this approach provides a scalable way to evaluate web-generation models. It allows developers to get high-fidelity feedback on how their models perform in real-world, interactive scenarios, helping to bridge the gap between simple code generation and the creation of fully functional, user-ready web applications.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!