Back to AI Research

AI Research

To Call or Not to Call: A Framework to Assess and O... | AI Research

Key Takeaways

  • To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling Large Language Models (LLMs) are increasingly being equipped with external tools...
  • Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities.
  • However, tool use is not always beneficial; some calls may be redundant or even harmful.
  • Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, when performing a task.
  • This decision is particularly challenging for web search tools, where the benefits of external information depend on the model's internal knowledge and its ability to integrate potentially noisy tool responses.
Paper AbstractExpand

Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, when performing a task. This decision is particularly challenging for web search tools, where the benefits of external information depend on the model's internal knowledge and its ability to integrate potentially noisy tool responses. We introduce a principled framework inspired by decision-making theory to evaluate web search tool-use decisions along three key factors: necessity, utility, and affordability. Our analysis combines two complementary lenses: a normative perspective that infers true need and utility from an optimal allocation of tool calls, and a descriptive perspective that infers the model's self-perceived need and utility from their observed behaviors. We find that models' perceived need and utility of tool calls are often misaligned with their true need and utility. Building on this framework, we train lightweight estimators of need and utility based on models' hidden states. Our estimators enable simple controllers that can improve decision quality and lead to stronger task performance than the self-perceived set up across three tasks and six models.

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
Large Language Models (LLMs) are increasingly being equipped with external tools like web search to improve their performance. However, using these tools is not always helpful; it can be redundant, costly, or even harmful if the tool provides noisy information. This paper introduces a principled framework to evaluate whether an LLM should use a tool for a specific task. By analyzing the gap between how models currently decide to use tools and how they should ideally use them, the researchers provide a method to improve agentic decision-making.

Assessing Tool Use: Necessity, Utility, and Affordability

The authors propose evaluating tool-calling decisions through three core dimensions:

  • Necessity: Does the model actually need external information to solve the task, or can it rely on its own internal knowledge?

  • Utility: Does using the tool actually improve the final performance, or does it degrade the output?

  • Affordability: Given that tools often have costs (such as latency or financial expense), is the performance gain significant enough to justify the cost of the call?
    To study these, the researchers compare a "normative" perspective (what an ideal, optimal system would do) against a "descriptive" perspective (how current models actually behave).

The Misalignment Problem

The study reveals that current LLMs often struggle to make rational tool-calling decisions. There is a clear misalignment between a model’s "perceived" need for a tool and its "true" need. While models are internally consistent—meaning their decisions follow a logical pattern—these patterns do not match the actual performance benefits. Consequently, models frequently call tools when they are not needed or fail to call them when they would have provided a significant boost in accuracy. This explains why "self-deciding" models often underperform compared to an "optimal" oracle policy.

Improving Decisions with Latent Estimators

To bridge the gap between current behavior and optimal performance, the researchers developed lightweight "controllers." Instead of relying on the model's own potentially flawed judgment, they trained classifiers that analyze the model's internal hidden states. These estimators predict the true need and utility of a tool call before the model commits to using it. By using these estimators to guide the decision-making process, the researchers were able to improve task performance across six different models, demonstrating that internal representations are a more reliable signal for tool use than the model's own output.

Key Takeaways and Limitations

The research demonstrates that tool-calling is not universally beneficial and that current models lack the ability to accurately judge when a tool will help. While the proposed latent estimators successfully improve decision quality, the authors note that they do not reach perfect "oracle" performance. They conclude that fully closing the gap between current models and optimal performance would require better, more sophisticated models of the tools themselves, as the behavior of external tools is inherently complex and difficult to predict.

Comments (0)

No comments yet

Be the first to share your thoughts!