AI Research

PACE: A Proxy for Agentic Capability Evaluation | AI Research

Key Takeaways

Evaluating large language model (LLM) agents is a significant bottleneck in AI development.
Evaluating LLM agents on benchmarks like SWE-Bench and GAIA can be expensive, time-consuming, and requires complex infrastructure.
A single evaluation can cost thousands of dollars and take days to complete.
In contrast, non-agentic LLM benchmarks that test individual capabilities (e.g., reasoning, code generation) are fast and cheap to run.
In this paper, we investigate whether performance on expensive agentic benchmarks can be accurately predicted by the performance on a small, carefully selected subset of atomic evaluation instances.

Paper AbstractExpand

Evaluating LLM agents on benchmarks like SWE-Bench and GAIA can be expensive, time-consuming, and requires complex infrastructure. A single evaluation can cost thousands of dollars and take days to complete. In contrast, non-agentic LLM benchmarks that test individual capabilities (e.g., reasoning, code generation) are fast and cheap to run. In this paper, we investigate whether performance on expensive agentic benchmarks can be accurately predicted by the performance on a small, carefully selected subset of atomic evaluation instances. We introduce PACE, a framework that constructs proxy benchmarks by selecting instances from existing non-agentic evaluations whose aggregate scores most reliably predict model performances on agentic benchmarks. Given a pool of candidate instances spanning atomic capabilities, PACE fits a regression that maps a model's scores on a compact subset of source instances to its score on the target agentic benchmark. The subset itself is curated by combining two complementary instance-selection strategies, target-relevance local selection and globally informative global selection. We apply PACE to the 4 target agentic benchmarks in this paper, which yields PACE-Bench, the concrete proxy benchmark that we evaluate in the paper. Experiments across 14 models, 4 agentic benchmarks, and 19 non-agentic benchmarks show that PACE-Bench predicts agentic scores with leave-one-out cross-validation (LOOCV) mean absolute error (MAE) under 4%, Spearman correlation above 0.80, and pairwise model-ranking accuracy around 85%, all at much less than 1% of the full agentic evaluation cost. We further analyze the selected proxy instances, revealing which skills each agentic benchmark uniquely demands. PACE enables practitioners to obtain reliable estimates of agentic performance during model development, selection, and routing, without the overhead of full agent evaluation.

Evaluating large language model (LLM) agents is a significant bottleneck in AI development. Benchmarks like SWE-Bench and GAIA require complex infrastructure, long execution times, and can cost thousands of dollars per evaluation. The PACE (Proxy for Agentic Capability Evaluation) framework addresses this by predicting how a model will perform on these expensive agentic benchmarks using only a small, carefully selected subset of inexpensive, non-agentic evaluation instances.

The PACE Approach

PACE treats the challenge of predicting agentic performance as a budget-constrained subset selection problem. Instead of running a full agentic evaluation, the framework identifies a compact set of "proxy" instances from existing, fast-to-run benchmarks—such as those testing reasoning, code generation, or instruction following.
To select these instances, PACE uses two complementary strategies:

Target-Relevance (Local): This identifies instances that show a strong statistical correlation with the target agentic benchmark.
Globally Informative (Global): This uses a mathematical technique called Singular Value Decomposition (SVD) to identify instances that contribute most to the overall "latent structure" or core capabilities of the model pool.
By combining these, PACE creates a "proxy benchmark" that captures the essential skills required for agentic tasks. Once these instances are selected, the framework uses a noise-aware regression model to map a new model’s scores on these few instances to a predicted score on the full agentic benchmark.

Key Results

The researchers tested PACE across 14 models, 4 agentic benchmarks, and 19 non-agentic benchmarks. The results demonstrate that this proxy method is highly effective:

Cost Efficiency: PACE achieves its predictions at less than 1% of the cost of a full agentic evaluation.
High Accuracy: Across the tested benchmarks, PACE achieved a mean absolute error of under 4% and a Spearman correlation above 0.80.
Reliable Ranking: The framework successfully predicts which of two models is stronger on an agentic benchmark with approximately 85% accuracy.
Generalization: The proxy instances selected using a training set of models remained predictive even when tested on held-out models that were not part of the initial selection process.

Why This Matters

The cost-accuracy tradeoff provided by PACE is smooth and predictable, allowing researchers to choose an evaluation budget that fits their specific resource constraints. Beyond just saving money, the framework offers a level of interpretability; by analyzing which proxy instances are selected, researchers can gain insight into which specific capabilities—such as planning or tool use—are most critical for success in different agentic tasks. This enables developers to obtain reliable performance estimates during the model development and routing process without the overhead of full-scale agentic testing.

Comments (0)

No comments yet

Be the first to share your thoughts!