To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
Large Language Models (LLMs) are increasingly being equipped with external tools like web search to improve their performance. However, using these tools is not always helpful; it can be redundant, costly, or even harmful if the tool provides noisy information. This paper introduces a principled framework to evaluate whether an LLM should use a tool for a specific task. By analyzing the gap between how models currently decide to use tools and how they should ideally use them, the researchers provide a method to improve agentic decision-making.
Assessing Tool Use: Necessity, Utility, and Affordability
The authors propose evaluating tool-calling decisions through three core dimensions:
Necessity: Does the model actually need external information to solve the task, or can it rely on its own internal knowledge?
Utility: Does using the tool actually improve the final performance, or does it degrade the output?
Affordability: Given that tools often have costs (such as latency or financial expense), is the performance gain significant enough to justify the cost of the call?
To study these, the researchers compare a "normative" perspective (what an ideal, optimal system would do) against a "descriptive" perspective (how current models actually behave).
The Misalignment Problem
The study reveals that current LLMs often struggle to make rational tool-calling decisions. There is a clear misalignment between a model’s "perceived" need for a tool and its "true" need. While models are internally consistent—meaning their decisions follow a logical pattern—these patterns do not match the actual performance benefits. Consequently, models frequently call tools when they are not needed or fail to call them when they would have provided a significant boost in accuracy. This explains why "self-deciding" models often underperform compared to an "optimal" oracle policy.
Improving Decisions with Latent Estimators
To bridge the gap between current behavior and optimal performance, the researchers developed lightweight "controllers." Instead of relying on the model's own potentially flawed judgment, they trained classifiers that analyze the model's internal hidden states. These estimators predict the true need and utility of a tool call before the model commits to using it. By using these estimators to guide the decision-making process, the researchers were able to improve task performance across six different models, demonstrating that internal representations are a more reliable signal for tool use than the model's own output.
Key Takeaways and Limitations
The research demonstrates that tool-calling is not universally beneficial and that current models lack the ability to accurately judge when a tool will help. While the proposed latent estimators successfully improve decision quality, the authors note that they do not reach perfect "oracle" performance. They conclude that fully closing the gap between current models and optimal performance would require better, more sophisticated models of the tools themselves, as the behavior of external tools is inherently complex and difficult to predict.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!