Back to AI Research

AI Research

Apparent Psychological Profiles of Large Language M... | AI Research

Key Takeaways

  • Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact This research investigates whether the psychological profiles ass...
  • Using a formal psychometric framework, we show that these profiles are largely a measurement artifact.
  • Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings.
  • Second, the bias declines with model capability but is not eliminated by it.
  • Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection.
Paper AbstractExpand

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact
This research investigates whether the psychological profiles assigned to Large Language Models (LLMs)—such as personality traits or risk preferences—are genuine characteristics of the models or simply byproducts of how they are tested. By applying a formal psychometric framework to 56 different LLMs, the authors demonstrate that what appears to be a stable "personality" is actually a measurement artifact driven by a consistent directional response bias.

The Problem with Human-Designed Tests

Researchers often use psychological instruments designed for humans to measure LLM behavior. These tests typically rely on a mix of "forward-keyed" items (where a "yes" indicates a trait) and "reverse-keyed" items (where a "no" indicates the same trait). In humans, these tests successfully separate a person's actual traits from their tendency to simply agree or disagree with statements. The authors found that when these same tests are applied to LLMs, the models do not behave like humans. Instead of their responses being driven by the content of the questions, 81–90% of the variation between models is caused by a directional response bias—a tendency to favor one end of a scale or a specific labeled option regardless of what the question asks.

How Bias Shapes Model Profiles

The study reveals that LLM responses are heavily influenced by the structure of the test rather than the underlying psychological construct. Because these models often lack "response orthogonality"—the design feature that balances forward and reverse items to cancel out bias—their scores are essentially manufactured by the selection of items. If a researcher chooses a set of questions that are not perfectly balanced, they can effectively "create" a specific personality profile for a model simply by how they frame the test. This explains why different studies often report conflicting psychological profiles for the same models.

Capability Does Not Eliminate Bias

A key question is whether more advanced, larger models are better at avoiding these biases. The researchers found that while increasing a model's capability (measured by parameter count and proprietary status) does slightly reduce the intensity of the response bias, it does not eliminate it. Even the most capable models tested still exhibit significantly higher levels of bias than the average human. This suggests that the "personality" observed in current LLMs is not a sign of human-like cognitive development, but rather a persistent feature of how these models process and respond to structured prompts.

Implications for Future Research

The authors conclude that current psychological profiling of LLMs is largely invalid because the instruments used were not designed for the way these models function. Because the apparent consistency of an LLM's "personality" is almost entirely predicted by the design of the test, the researchers argue that the field needs to move away from standard human questionnaires. Instead, they call for the development of dedicated assessment methods that prioritize response orthogonality, ensuring that future measurements can distinguish between a model's actual behavioral tendencies and the mechanical biases inherent in the testing process.

Comments (0)

No comments yet

Be the first to share your thoughts!