Back to AI Research

AI Research

Superficial Beliefs in LLM Decision-Making | AI Research

Key Takeaways

  • Superficial Beliefs in LLM Decision-Making This research investigates whether Large Language Models (LLMs) possess a genuine, structured internal logic when...
  • We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure.
  • The behavioural model predicts held-out choices well, showing that model behaviour is systematically related to the visible attributes rather than being random.
  • However, direct self-reports and a separate score-based judge recover the behaviourally inferred driver only partially.
  • The resulting picture is neither one of arbitrary behaviour nor one of fully articulated belief - outputs are structured enough to support prediction, but explicit reasons track the recovered driver only imperfectly.
Paper AbstractExpand

We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which models choose between profiles defined by graded attributes, we compare the attribute a model says mattered most with the attribute that best explains its choice under a behavioural model fit to prior decisions. The behavioural model predicts held-out choices well, showing that model behaviour is systematically related to the visible attributes rather than being random. However, direct self-reports and a separate score-based judge recover the behaviourally inferred driver only partially. The resulting picture is neither one of arbitrary behaviour nor one of fully articulated belief - outputs are structured enough to support prediction, but explicit reasons track the recovered driver only imperfectly. This qualitative pattern persists across prompt-order and sampling perturbations, alternative behavioural models, targeted occlusion analyses, and structurally varied decision settings. We interpret this as evidence for ``superficial belief'' in LLM decision-making: models behave as if guided by probabilistic local priorities over attributes, while having only limited verbal access to the attributes that drive their decisions.

Superficial Beliefs in LLM Decision-Making
This research investigates whether Large Language Models (LLMs) possess a genuine, structured internal logic when making decisions, or if they are simply mimicking the language of reasoning. By comparing the attributes that actually drive a model’s choices with the reasons the model provides when asked, the authors explore whether LLMs hold "superficial beliefs"—a state where a model’s behavior is consistent and predictable, even if its own verbal explanations only partially reflect the true drivers of its decisions.

Testing Decision Logic

To determine if LLM choices are systematic, the researchers created a synthetic benchmark consisting of binary decision problems. Each problem required the model to choose between two profiles defined by four graded attributes (such as "Efficacy" or "Safety"). By analyzing hundreds of these decisions, the team built a "behavioral model" that could predict how an LLM would choose in new, unseen scenarios. This allowed the researchers to identify which specific attribute was the most likely "driver" of a model's choice based on its actual performance.

Comparing Behavior to Explanation

The core of the study involved comparing these behaviorally inferred drivers against two types of explicit self-reports:

  • Direct Response: Asking the model to state which attribute was most important after it made a choice.

  • Score-based Judge: Asking the model to assign a numerical score to each attribute to reveal its underlying priorities.
    The results showed that while the behavioral model was highly accurate at predicting the LLM’s choices, the models’ own explanations—whether given as a direct statement or a numerical score—only partially matched the actual drivers of those choices.

The "Superficial Belief" Finding

The study concludes that LLM decision-making is neither entirely random nor fully transparent. The models exhibit a "weak" form of superficial belief: their behavior is structured enough to be predicted by their past actions, but they lack the ability to accurately articulate the internal logic behind those actions. This pattern remained consistent across different model families, various prompt settings, and even when researchers introduced control attributes that were irrelevant to the decision.

What This Means for AI Transparency

The findings suggest that we should be cautious when relying on an LLM’s self-reported reasoning. Because the models’ explicit justifications often diverge from the patterns that actually dictate their behavior, a model’s explanation may not be a faithful account of its decision-making process. This highlights a gap between how models act and how they describe their own "thought" processes, suggesting that "belief" in AI is better understood as a stable pattern of behavior rather than a fully accessible, articulated set of reasons.

Comments (0)

No comments yet

Be the first to share your thoughts!