When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More
This research investigates a common assumption in AI development: that Large Language Model (LLM) agents act as "discerning callers" when using external tools. Specifically, the authors test whether an agent can intelligently decide when to trust a Graph Neural Network (GNN) tool and when to rely on its own reasoning or other evidence. The study reveals that, contrary to expectations, agents do not exercise judgment; instead, they consistently collapse into "GNN parrots" that adopt the tool's output almost entirely, regardless of whether that output is correct.
The "GNN Parrot" Phenomenon
The researchers set up a ReAct-style LLM agent and provided it with a frozen GNN as a tool for node classification tasks. They measured the agent's behavior by comparing its final predictions to the GNN's output. The results were striking: the agent agreed with the GNN 97.6% to 99.2% of the time. Even when the agent was given a budget to perform multiple tool calls—including signals designed to flag when the GNN might be wrong—it typically made only one call, reading the label and ignoring the diagnostic information. The agent’s final answers matched its own independent reasoning only a small fraction of the time, suggesting that the presence of the tool effectively overrides the agent's internal logic.
Capability Does Not Improve Skepticism
A key question was whether this blind deference is simply a result of using weaker models. By testing a range of Qwen2.5 models (from 0.5B to 7B parameters), the authors found that the opposite is true. While the smallest models struggled to use the tool at all, models with higher capabilities showed even higher rates of agreement with the GNN. As the models became more powerful, they did not become more skeptical or discerning; instead, they became more efficient at adopting the tool's output wholesale.
The Cost of Deference
The researchers measured the "oracle gap"—the difference between the agent's performance and the performance of an ideal system that could perfectly choose between the GNN and other available evidence. They found that this gap does not shrink as models get stronger. In fact, as more capable agents developed better alternative ways to solve the task (such as using a simple neighbor-label lookup), they continued to ignore these better options in favor of the GNN. This creates a "cost of deference," where the agent's performance is held back by its refusal to look beyond the tool, even when it has the capability to identify a better answer elsewhere.
Challenges for Future Design
The study concludes that "selective invocation"—the ability of an agent to know when to trust a tool—is not a skill that emerges automatically as models scale. The authors attempted to build simple "gates" to help the agent decide when to use the GNN, but these efforts yielded no net global gain. Their analysis suggests that this is not merely a failure of router design, but a limitation of the information currently available to the agent at test time. The findings serve as a cautionary note for the AI community: developers cannot assume that adding a tool to an agent will result in a system that adds judgment; instead, intelligent tool use must be explicitly designed into the system.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!