Back to AI Research

AI Research

From Controlled to the Wild: Evaluation of Pentesti... | AI Research

Key Takeaways

  • From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World This paper addresses a critical gap in the field of offensive AI: current ben...
  • AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets.
  • Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings.
  • These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting.
  • This protocol extends the state of the art by enabling a more realistic, operationally informative comparison of AI pentesting agents.
Paper AbstractExpand

AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation of stochastic agents, efficiency metrics, and reduced-suite selection for sustainable experimentation. This protocol extends the state of the art by enabling a more realistic, operationally informative comparison of AI pentesting agents. To enable reproducibility, we also release expert-annotated ground truth and code for the proposed evaluation protocol: this https URL .

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
This paper addresses a critical gap in the field of offensive AI: current benchmarks for pentesting agents are too narrow and artificial to predict how these systems will perform in real-world environments. While existing tools often focus on simple tasks like "capture-the-flag" or reproducing a single exploit, they fail to capture the complex, open-ended decision-making required for professional security work. The authors propose a new, practical evaluation protocol that shifts the focus from completing predefined tasks to discovering and validating actual vulnerabilities in complex, realistic targets.

A New Approach to Evaluation

Instead of relying on static, "correct" answers, the authors introduce a methodology that treats vulnerability discovery as the primary metric. The protocol uses a structured pipeline where an AI agent’s findings are compared against a "ground-truth" list of known vulnerabilities. Because real-world security reports can be ambiguous, the system uses an LLM-as-a-judge to semantically match agent findings to the ground truth, followed by a mathematical process called bipartite resolution to ensure that duplicate reports are not incorrectly counted and that each vulnerability is credited accurately.

Managing Real-World Complexity

The protocol acknowledges that real-world pentesting is messy and unpredictable. To handle this, the authors suggest that ground truth should not be a static document, but a "living" resource that is updated through expert review. If an agent discovers a legitimate vulnerability that wasn't previously documented, that finding is used to improve the evaluation data for future tests. This ensures that the benchmark evolves alongside the agents being tested, preventing the evaluation from becoming obsolete.

Accounting for Stochasticity and Efficiency

Because AI agents are stochastic—meaning they may produce different results each time they run—the authors emphasize the importance of repeated testing. They argue that reporting only a single score is insufficient; instead, researchers should report the mean and standard deviation of performance across multiple runs. Furthermore, the protocol encourages "cumulative evaluation," where findings from several runs are combined to see how an agent’s performance improves over time. Finally, the framework treats efficiency as a first-class concern, requiring that researchers track total runtime and monetary costs, as these are essential factors for any organization considering the deployment of an AI pentesting system.

Sustainable Experimentation

Recognizing that running extensive tests on complex targets is expensive and time-consuming, the authors propose a method for selecting "reduced-suite" subsets of targets. By using historical data from previous experiments, researchers can identify a smaller, representative group of targets that provide similar insights to the full suite. This allows developers to conduct iterative testing and ablation studies without the prohibitive costs of full-scale evaluations, making the development of more effective AI pentesting agents more sustainable.

Comments (0)

No comments yet

Be the first to share your thoughts!