Cookie-Bench: Continuous On-screen Key Interaction...

Cookie-Bench: Continuous On-screen Key Interaction... | AI Research

Key Takeaways

Evaluating interactive web code generated by AI has become a major challenge.
Current methods often rely on rigid checklists or human-judged leaderboards, wh...
Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session.
We articulate a new evaluation regime that is simultaneously reference-free, autonomously driven, and holistically reasoned, and instantiate it through two artifacts.
On \dataname, \framename aligns closely with expert human ratings while surfacing substantial headroom across 13 frontier LLMs on interactive web generation.

Paper AbstractExpand

Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new evaluation regime that is simultaneously reference-free, autonomously driven, and holistically reasoned, and instantiate it through two artifacts. \textbf{\dataname} is an 11-domain, 54-leaf, 1,000-query WebDev benchmark spanning both static-presentation and interactive-application tasks, balanced across three difficulty tiers and three target-language groups, with briefs rewritten to resist recall from circulated prompts. \textbf{\framename}, grounded in Flavell's metacognitive monitoring, separates evidence accumulation from judgment across three stages: Static Perception forms a first impression from passive observation; Agent-Driven Interaction explores the application autonomously while capturing continuous screen video, audio, and per-step screenshots; Dynamic Scoring issues holistic functionality and aesthetics verdicts with structured failure attribution only after the evidence chain is complete. On \dataname, \framename aligns closely with expert human ratings while surfacing substantial headroom across 13 frontier LLMs on interactive web generation. \noindenthttps://anonymous. this http URL

Evaluating interactive web code generated by AI has become a major challenge. Current methods often rely on rigid checklists or human-judged leaderboards, which are either too inflexible to capture real-world user experiences or too slow and expensive to use during the development process. This paper introduces a new, automated evaluation system designed to mimic the reasoned, holistic judgment of a human reviewer without needing reference implementations or pre-authored test scripts.

A New Evaluation Regime

The researchers propose a two-part solution: Cookie-Bench and Cookie. Cookie-Bench is a comprehensive benchmark consisting of 1,000 queries across 11 domains, covering both static web pages and complex, interactive applications. To ensure the benchmark tests actual capability rather than memorization, the queries are designed to resist recall from existing datasets and are balanced across different difficulty levels and languages.

How the Evaluation Works

The evaluation tool, called Cookie, functions like a human reviewer by separating the gathering of evidence from the final judgment. It operates in three distinct stages:

Static Perception: The system loads the generated web page and forms a "first impression" by analyzing screenshots and runtime logs.
Agent-Driven Interaction: An autonomous agent explores the page, clicking buttons and navigating flows. During this process, the system captures continuous video, audio, and interaction traces to see how the page behaves in practice.
Dynamic Scoring: Only after all evidence is collected does the system issue a final verdict. It evaluates the page on functionality and aesthetics, providing specific reasons for any failures identified during the interaction.

Performance of Frontier Models

The researchers tested 13 frontier LLMs using this system, comparing their performance when generating code through an agent-based scaffold versus direct HTML chat. The results revealed significant performance gaps between models, with the agent-based scaffold helping to raise the performance floor for weaker models. The evaluation also highlighted that while many models handle static layouts well, they struggle more with the complex event wiring and state management required for interactive, dynamic web applications.

Why This Matters

By removing the need for human-authored checklists or ground-truth references, this approach provides a scalable way to evaluate web-generation models. It allows developers to get high-fidelity feedback on how their models perform in real-world, interactive scenarios, helping to bridge the gap between simple code generation and the creation of fully functional, user-ready web applications.