AgentBeats: Agentifying Agent Assessment for Openne...

AgentBeats: Agentifying Agent Assessment for Openne... | AI Research

Key Takeaways

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility The rapid growth of AI agent systems has created a fragmented eva...
Agent systems are advancing quickly across domains, but their evaluation remains fragmented.
Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs.
The root problem is the lack of an open, agent-agnostic assessment interface.
We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access.

Paper AbstractExpand

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility
The rapid growth of AI agent systems has created a fragmented evaluation landscape. Current benchmarks often rely on rigid, custom-built harnesses that require significant integration effort for every new agent, leading to a mismatch between how agents are tested and how they perform in real-world production. This paper introduces Agentified Agent Assessment (AAA), a new paradigm that treats benchmarks as agents themselves. By using standardized protocols for task management and tool access, AAA creates a unified, flexible framework that separates assessment logic from agent design, making it easier to evaluate diverse agents across various tasks.

A Standardized Approach to Evaluation

The core of the AAA paradigm is the use of two widely adopted, industry-standard protocols: A2A for managing tasks and MCP for accessing tools. By leveraging these existing standards, AAA eliminates the need for bespoke, benchmark-specific integrations. Instead of building a unique connection for every agent-benchmark pair, developers only need to ensure their agents are compatible with these protocols. This reduces the complexity of evaluation from a massive, multi-point integration effort to a simple, plug-and-play process.

How the AAA Workflow Functions

Under AAA, an evaluation involves three distinct roles: a delegator, a judge agent, and one or more subject agents. The delegator initiates the process by selecting the desired benchmark and target agents. The judge agent—which acts as the benchmark—manages the environment, distributes tasks, and monitors performance. The subject agents then attempt to complete the assigned tasks using their own internal logic and available tools. Because the judge agent controls the entire session, it can perform adaptive assessments, such as skipping redundant tests or generating more challenging tasks based on an agent's performance, which improves both efficiency and depth of insight.

Validating the Paradigm at Scale

To test the effectiveness of AAA, the researchers conducted two major studies. The first was a five-month open competition that involved 298 judge agents and 467 subject agents across 12 different categories, including coding, web browsing, and healthcare. This demonstrated that the AAA framework is highly versatile and capable of handling a wide range of heterogeneous benchmarks. The second study focused on coding agents, confirming that agentified evaluation maintains high fidelity with real-world performance while uncovering new insights into agent design that traditional, static benchmarks often miss.

Practicality and Future Impact

AgentBeats serves as a concrete implementation of the AAA paradigm, offering multiple operation modes to suit different privacy, openness, and reproducibility requirements. By turning evaluation into a reusable, production-aligned process, the authors argue that AAA significantly lowers the barrier to entry for developers. This approach not only makes it easier to compare different agent designs fairly but also provides a scalable foundation for the future of AI agent assessment, moving the field away from ad-hoc, fragmented testing toward a more standardized and interoperable ecosystem.