Back to AI Research

AI Research

RESTestBench: A Benchmark for Evaluating the Effect... | AI Research

Key Takeaways

  • RESTestBench is a new benchmark designed to evaluate how effectively Large Language Models (LLMs) can generate functional tests for REST APIs based on natura...
  • Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics.
  • However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour.
  • Using RESTestBench, we evaluate two approaches across multiple state-of-the-art LLMs: (i) non-refinement-based generation, and (ii) refinement-based generation guided by interaction with the running SUT.
  • In the refinement experiments, RESTestBench assesses how exposure to the actual implementation, valid or mutated, affects test effectiveness.
Paper AbstractExpand

Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench further introduces a requirements-based mutation testing metric that measures the fault-detection effectiveness of a generated test case with respect to a specific requirement, extending the property-based approach of Bartocci et al. . Using RESTestBench, we evaluate two approaches across multiple state-of-the-art LLMs: (i) non-refinement-based generation, and (ii) refinement-based generation guided by interaction with the running SUT. In the refinement experiments, RESTestBench assesses how exposure to the actual implementation, valid or mutated, affects test effectiveness. Our results show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.

RESTestBench is a new benchmark designed to evaluate how effectively Large Language Models (LLMs) can generate functional tests for REST APIs based on natural language requirements. While traditional testing tools rely on code coverage or crash detection, these metrics often fail to confirm if a test actually validates the intended business logic. RESTestBench addresses this by providing a controlled environment where LLMs must translate human-written requirements into executable tests, which are then measured for their ability to detect specific, requirement-related faults.

A New Approach to Measuring Test Quality

The core innovation of RESTestBench is the use of Property-Based Mutation Testing (PBMT). Unlike standard mutation testing, which simply checks if a test can detect any small change in code, PBMT ensures that a test is only considered "effective" if it detects a change that specifically violates the requirement it is meant to verify. This allows researchers to distinguish between tests that are genuinely checking for correct behavior and those that are merely passing due to incidental implementation details.

Benchmark Components

The benchmark provides a standardized framework for testing, consisting of three key elements:

  • Diverse Services: It includes three distinct REST API services (FastAPI, TodoApp, and NestJS RealWorld) that represent real-world backend complexity, including authentication and multi-entity workflows.

  • Requirement Variants: Each requirement is provided in both "vague" and "precise" formats. This allows researchers to study how the level of detail in a prompt affects an LLM's ability to generate a successful test.

  • Validated Mutations: The benchmark includes 228 manually designed mutations tied to specific requirements, providing a clear "ground truth" for whether an LLM-generated test is performing its job correctly.

Insights into LLM Behavior

Using this benchmark, the researchers compared two common generation strategies: non-refinement (generating tests in one step) and refinement (iteratively improving tests by interacting with the running service). The results revealed a significant performance drop when LLMs were exposed to faulty or mutated code during the refinement process. This effect was particularly pronounced with vague requirements. The findings suggest that when requirements are highly detailed and precise, the extra effort of having an LLM interact with the actual system during test generation may be unnecessary, as it can actually introduce confusion rather than clarity.

Implications for Future Testing

RESTestBench highlights that the "oracle problem"—the difficulty of knowing what the correct output should be—remains a major hurdle for automated testing. By moving away from generic metrics like code coverage and toward requirement-specific validation, the benchmark provides a more reliable way to compare different AI-driven testing tools. It serves as a foundation for future research, allowing developers to easily integrate new models and approaches to see if they truly understand the business logic they are tasked with verifying.

Comments (0)

No comments yet

Be the first to share your thoughts!