When the Tool Decides: LLM Agents Defer Blindly to...

When the Tool Decides: LLM Agents Defer Blindly to... | AI Research

Key Takeaways

When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More This research investigates a common assumpti...
A growing line of work equips large language model (LLM) agents with graph neural networks (GNNs) as callable tools, assuming the agent exercises judgment over when and how much to rely on such a tool.
Sweeping backbone capability (Qwen2.5 0.5B-7B), the deference is not a weak-model artifact: among models able to invoke the tool, agreement rises with capability (0.60 to 0.98 from 1.5B to 7B).
When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More
This research investigates a common assumption in AI development: that Large Language Model (LLM) agents act as "discerning callers" when using external tools.

Paper AbstractExpand

A growing line of work equips large language model (LLM) agents with graph neural networks (GNNs) as callable tools, assuming the agent exercises judgment over when and how much to rely on such a tool. We test this directly. We expose a frozen GNN to a ReAct-style LLM agent as an explicit tool and measure, on node classification over a text-attributed graph (ogbn-arxiv, replicated on WikiCS), whether the agent uses the tool or merely obeys it. We find the agent does not exercise judgment: its predictions agree with the raw GNN's 97.6-99.2% of the time (5 seeds), collapsing into a GNN parrot that adopts the tool's output wholesale and bypasses its own reasoning. Sweeping backbone capability (Qwen2.5 0.5B-7B), the deference is not a weak-model artifact: among models able to invoke the tool, agreement rises with capability (0.60 to 0.98 from 1.5B to 7B). Crucially, the cost of deference does not shrink as capability grows and grows where alternatives emerge: a per-node oracle over the available actions beats the parrot by 0.09-0.18 at 3B and 0.12-0.22 at 7B, roughly doubling at high homophily, because the parrot is pinned to the frozen GNN while the agent's alternatives improve; at 7B a simple neighbour-label tool overtakes the GNN at high homophily (0.81 vs 0.71) yet the agent still defers. A simple selective-invocation gate recovers about half of that high-homophily gap (0.71 to 0.83) but yields no net global gain, and held-out estimates bound the best achievable gate over standard test-time features to at most a third of the oracle headroom: reliable selective invocation looks limited by available information, not merely router design. Our results are a cautionary measurement: evaluations of agent+tool systems cannot assume the agent adds judgment on top of the tool, and selective invocation must be designed in rather than expected to emerge from scale.

When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More
This research investigates a common assumption in AI development: that Large Language Model (LLM) agents act as "discerning callers" when using external tools. Specifically, the authors test whether an agent can intelligently decide when to trust a Graph Neural Network (GNN) tool and when to rely on its own reasoning or other evidence. The study reveals that, contrary to expectations, agents do not exercise judgment; instead, they consistently collapse into "GNN parrots" that adopt the tool's output almost entirely, regardless of whether that output is correct.

The "GNN Parrot" Phenomenon

The researchers set up a ReAct-style LLM agent and provided it with a frozen GNN as a tool for node classification tasks. They measured the agent's behavior by comparing its final predictions to the GNN's output. The results were striking: the agent agreed with the GNN 97.6% to 99.2% of the time. Even when the agent was given a budget to perform multiple tool calls—including signals designed to flag when the GNN might be wrong—it typically made only one call, reading the label and ignoring the diagnostic information. The agent’s final answers matched its own independent reasoning only a small fraction of the time, suggesting that the presence of the tool effectively overrides the agent's internal logic.

Capability Does Not Improve Skepticism

A key question was whether this blind deference is simply a result of using weaker models. By testing a range of Qwen2.5 models (from 0.5B to 7B parameters), the authors found that the opposite is true. While the smallest models struggled to use the tool at all, models with higher capabilities showed even higher rates of agreement with the GNN. As the models became more powerful, they did not become more skeptical or discerning; instead, they became more efficient at adopting the tool's output wholesale.

The Cost of Deference

The researchers measured the "oracle gap"—the difference between the agent's performance and the performance of an ideal system that could perfectly choose between the GNN and other available evidence. They found that this gap does not shrink as models get stronger. In fact, as more capable agents developed better alternative ways to solve the task (such as using a simple neighbor-label lookup), they continued to ignore these better options in favor of the GNN. This creates a "cost of deference," where the agent's performance is held back by its refusal to look beyond the tool, even when it has the capability to identify a better answer elsewhere.

Challenges for Future Design

The study concludes that "selective invocation"—the ability of an agent to know when to trust a tool—is not a skill that emerges automatically as models scale. The authors attempted to build simple "gates" to help the agent decide when to use the GNN, but these efforts yielded no net global gain. Their analysis suggests that this is not merely a failure of router design, but a limitation of the information currently available to the agent at test time. The findings serve as a cautionary note for the AI community: developers cannot assume that adding a tool to an agent will result in a system that adds judgment; instead, intelligent tool use must be explicitly designed into the system.