AgentSearchBench: A Benchmark for AI Agent Search in the Wild addresses the growing difficulty of finding the right AI agent for a specific task. As the ecosystem of available agents expands, users and developers struggle to identify which ones will actually perform well, as textual descriptions are often misleading or incomplete. This paper introduces a large-scale benchmark designed to evaluate how well different search methods can identify and rank agents based on their actual performance rather than just their documentation.
The Challenge of Agent Discovery
Unlike traditional software tools, AI agents are often compositional and their capabilities depend heavily on the environment in which they are executed. Because of this, two agents with similar-sounding descriptions might perform very differently in practice. Current search methods often rely on static text matching, which fails to capture this "execution-dependent" nature. The authors argue that effective agent discovery must move beyond simple keyword or semantic similarity and instead incorporate signals from how agents behave when they are actually put to work.
A New Benchmark for Real-World Agents
To study this problem, the researchers built AgentSearchBench using nearly 10,000 real-world agents sourced from various public platforms. The benchmark formalizes the search process into two categories: * Executable Task Queries: Direct, concrete instructions that can be tested immediately. * High-Level Task Descriptions: Broader goals that require the system to understand implicit requirements.
The team created a pipeline that generates these tasks and evaluates agents using an "LLM-as-a-judge" approach. By running over 66,000 execution tests, they established a ground truth for which agents are truly capable of completing specific tasks, allowing for a more accurate evaluation of search and ranking models.
Key Findings and the Performance Gap
The experiments revealed a consistent "semantic–performance gap," where the agents that appear most relevant based on text descriptions are often not the ones that perform best during execution. This gap is particularly wide when starting from high-level task descriptions, where the requirements are less explicit.
The study also demonstrated that "execution-aware probing"—using lightweight behavioral signals from actual agent runs—can significantly improve the quality of search results. This suggests that incorporating performance data is essential for building reliable agent discovery systems.
Implications for Future Development
The findings highlight that relying solely on documentation or semantic embeddings is insufficient for navigating modern, open agent ecosystems. By providing this large-scale benchmark, the authors aim to encourage the development of retrieval and ranking systems that prioritize functional competence. The research underscores that in a world of autonomous agents, the ability to verify performance through interaction is just as important as the ability to describe a task.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!