Evaluating large language model (LLM) agents is a significant bottleneck in AI development. Benchmarks like SWE-Bench and GAIA require complex infrastructure, long execution times, and can cost thousands of dollars per evaluation. The PACE (Proxy for Agentic Capability Evaluation) framework addresses this by predicting how a model will perform on these expensive agentic benchmarks using only a small, carefully selected subset of inexpensive, non-agentic evaluation instances.
The PACE Approach
PACE treats the challenge of predicting agentic performance as a budget-constrained subset selection problem. Instead of running a full agentic evaluation, the framework identifies a compact set of "proxy" instances from existing, fast-to-run benchmarks—such as those testing reasoning, code generation, or instruction following.
To select these instances, PACE uses two complementary strategies:
Target-Relevance (Local): This identifies instances that show a strong statistical correlation with the target agentic benchmark.
Globally Informative (Global): This uses a mathematical technique called Singular Value Decomposition (SVD) to identify instances that contribute most to the overall "latent structure" or core capabilities of the model pool.
By combining these, PACE creates a "proxy benchmark" that captures the essential skills required for agentic tasks. Once these instances are selected, the framework uses a noise-aware regression model to map a new model’s scores on these few instances to a predicted score on the full agentic benchmark.
Key Results
The researchers tested PACE across 14 models, 4 agentic benchmarks, and 19 non-agentic benchmarks. The results demonstrate that this proxy method is highly effective:
Cost Efficiency: PACE achieves its predictions at less than 1% of the cost of a full agentic evaluation.
High Accuracy: Across the tested benchmarks, PACE achieved a mean absolute error of under 4% and a Spearman correlation above 0.80.
Reliable Ranking: The framework successfully predicts which of two models is stronger on an agentic benchmark with approximately 85% accuracy.
Generalization: The proxy instances selected using a training set of models remained predictive even when tested on held-out models that were not part of the initial selection process.
Why This Matters
The cost-accuracy tradeoff provided by PACE is smooth and predictable, allowing researchers to choose an evaluation budget that fits their specific resource constraints. Beyond just saving money, the framework offers a level of interpretability; by analyzing which proxy instances are selected, researchers can gain insight into which specific capabilities—such as planning or tool use—are most critical for success in different agentic tasks. This enables developers to obtain reliable performance estimates during the model development and routing process without the overhead of full-scale agentic testing.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!