From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
This paper addresses a critical gap in the field of offensive AI: current benchmarks for pentesting agents are too narrow and artificial to predict how these systems will perform in real-world environments. While existing tools often focus on simple tasks like "capture-the-flag" or reproducing a single exploit, they fail to capture the complex, open-ended decision-making required for professional security work. The authors propose a new, practical evaluation protocol that shifts the focus from completing predefined tasks to discovering and validating actual vulnerabilities in complex, realistic targets.
A New Approach to Evaluation
Instead of relying on static, "correct" answers, the authors introduce a methodology that treats vulnerability discovery as the primary metric. The protocol uses a structured pipeline where an AI agent’s findings are compared against a "ground-truth" list of known vulnerabilities. Because real-world security reports can be ambiguous, the system uses an LLM-as-a-judge to semantically match agent findings to the ground truth, followed by a mathematical process called bipartite resolution to ensure that duplicate reports are not incorrectly counted and that each vulnerability is credited accurately.
Managing Real-World Complexity
The protocol acknowledges that real-world pentesting is messy and unpredictable. To handle this, the authors suggest that ground truth should not be a static document, but a "living" resource that is updated through expert review. If an agent discovers a legitimate vulnerability that wasn't previously documented, that finding is used to improve the evaluation data for future tests. This ensures that the benchmark evolves alongside the agents being tested, preventing the evaluation from becoming obsolete.
Accounting for Stochasticity and Efficiency
Because AI agents are stochastic—meaning they may produce different results each time they run—the authors emphasize the importance of repeated testing. They argue that reporting only a single score is insufficient; instead, researchers should report the mean and standard deviation of performance across multiple runs. Furthermore, the protocol encourages "cumulative evaluation," where findings from several runs are combined to see how an agent’s performance improves over time. Finally, the framework treats efficiency as a first-class concern, requiring that researchers track total runtime and monetary costs, as these are essential factors for any organization considering the deployment of an AI pentesting system.
Sustainable Experimentation
Recognizing that running extensive tests on complex targets is expensive and time-consuming, the authors propose a method for selecting "reduced-suite" subsets of targets. By using historical data from previous experiments, researchers can identify a smaller, representative group of targets that provide similar insights to the full suite. This allows developers to conduct iterative testing and ablation studies without the prohibitive costs of full-scale evaluations, making the development of more effective AI pentesting agents more sustainable.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!