Superficial Beliefs in LLM Decision-Making
This research investigates whether Large Language Models (LLMs) possess a genuine, structured internal logic when making decisions, or if they are simply mimicking the language of reasoning. By comparing the attributes that actually drive a model’s choices with the reasons the model provides when asked, the authors explore whether LLMs hold "superficial beliefs"—a state where a model’s behavior is consistent and predictable, even if its own verbal explanations only partially reflect the true drivers of its decisions.
Testing Decision Logic
To determine if LLM choices are systematic, the researchers created a synthetic benchmark consisting of binary decision problems. Each problem required the model to choose between two profiles defined by four graded attributes (such as "Efficacy" or "Safety"). By analyzing hundreds of these decisions, the team built a "behavioral model" that could predict how an LLM would choose in new, unseen scenarios. This allowed the researchers to identify which specific attribute was the most likely "driver" of a model's choice based on its actual performance.
Comparing Behavior to Explanation
The core of the study involved comparing these behaviorally inferred drivers against two types of explicit self-reports:
Direct Response: Asking the model to state which attribute was most important after it made a choice.
Score-based Judge: Asking the model to assign a numerical score to each attribute to reveal its underlying priorities.
The results showed that while the behavioral model was highly accurate at predicting the LLM’s choices, the models’ own explanations—whether given as a direct statement or a numerical score—only partially matched the actual drivers of those choices.
The "Superficial Belief" Finding
The study concludes that LLM decision-making is neither entirely random nor fully transparent. The models exhibit a "weak" form of superficial belief: their behavior is structured enough to be predicted by their past actions, but they lack the ability to accurately articulate the internal logic behind those actions. This pattern remained consistent across different model families, various prompt settings, and even when researchers introduced control attributes that were irrelevant to the decision.
What This Means for AI Transparency
The findings suggest that we should be cautious when relying on an LLM’s self-reported reasoning. Because the models’ explicit justifications often diverge from the patterns that actually dictate their behavior, a model’s explanation may not be a faithful account of its decision-making process. This highlights a gap between how models act and how they describe their own "thought" processes, suggesting that "belief" in AI is better understood as a stable pattern of behavior rather than a fully accessible, articulated set of reasons.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!