Back to AI Research

AI Research

DRIFTLENS: Measuring Memory-Induced Reasoning Drift... | AI Research

Key Takeaways

  • Modern language models often personalize interactions by storing user information—such as age, occupation, or disability status—and injecting it into future...
  • Personalization changes what a model says to a user; we show that it can also change the reasoning trajectory used to justify the response.
  • Modern LLMs personalize interactions by storing user attributes, preferences, and prior context, then injecting this information into future prompts.
  • We study whether such memory reshapes reasoning on open-ended questions where no single ground-truth answer exists.
  • We first validate that DRIFTLENS distinguishes content-free pragmatic noise from substantive reasoning changes.
Paper AbstractExpand

Personalization changes what a model says to a user; we show that it can also change the reasoning trajectory used to justify the response. Modern LLMs personalize interactions by storing user attributes, preferences, and prior context, then injecting this information into future prompts. We study whether such memory reshapes reasoning on open-ended questions where no single ground-truth answer exists. To quantify this effect, we introduce DRIFTLENS, a ground-truth-free framework that maps each expressed reasoning step to a value category and measures divergence between a question's no-memory trajectory and its trajectory under injected user-attribute memory. We first validate that DRIFTLENS distinguishes content-free pragmatic noise from substantive reasoning changes. Across four LLMs and 10 user-attribute categories, including age, occupation, and disability, user-attribute memory induces medium-to-large reasoning drift above each model's pragmatic-noise floor, even when final answers remain fluent, on-topic, and plausible. We then evaluate GRPO- and DPO-based post-training methods for reducing drift. Both reduce drift, but neither uniformly dominates; effects on downstream capability, helpfulness, and instruction following are model-and reward-dependent. These results suggest that memory-induced reasoning drift is a measurable and only partly mitigated failure mode of personalized language models.

Modern language models often personalize interactions by storing user information—such as age, occupation, or disability status—and injecting it into future prompts. While this helps models tailor their tone or content, it can also unintentionally alter the logical path the model takes to reach a conclusion. This paper introduces DRIFTLENS, a framework designed to measure this "reasoning drift," where a model’s decision-making process changes based on irrelevant personal context, even when the final answer remains plausible and on-topic.

Measuring Invisible Reasoning Shifts

Because many open-ended questions lack a single "correct" answer, standard accuracy metrics cannot detect when a model’s reasoning has been skewed by stored user data. DRIFTLENS solves this by creating a baseline: it compares the reasoning trajectory of a model responding to a question without memory against the trajectory of the same model when user attributes are injected. By mapping these reasoning steps into a structured "value ontology," the framework can mathematically quantify how much the model’s internal logic diverges when it is "aware" of specific user traits.

Findings on Reasoning Sensitivity

The researchers tested four different language models across 10 categories of user attributes. They found that even when the injected information was entirely irrelevant to the question at hand, the models exhibited medium-to-large reasoning drift. This drift was consistently higher than the "pragmatic noise" floor (the natural variation in how a model might phrase a response). Notably, attributes like disability status and trans status were among the most significant drivers of this drift. While the final answers often appeared normal, the underlying justifications shifted, suggesting that personalization can subtly reshape a model's priorities and trade-offs.

Mitigation and Trade-offs

The study also evaluated two post-training methods—GRPO (an online reinforcement learning approach) and DPO (an offline preference-based approach)—to see if they could reduce this drift. Both methods successfully lowered the amount of reasoning drift, but neither was a perfect solution. The researchers observed that reducing drift often came at a cost, creating a complex trade-off between maintaining reasoning stability and preserving other model capabilities, such as helpfulness and instruction-following.

Key Takeaways

The results suggest that memory-induced reasoning drift is a persistent and measurable failure mode in personalized AI. Because this drift is often invisible at the answer level, it represents a hidden risk in how models handle sensitive or value-laden topics. The authors propose that DRIFTLENS serves as an important auditing tool, allowing developers to identify when and how persona-based memory is unintentionally influencing a model's decision-making process.

Comments (0)

No comments yet

Be the first to share your thoughts!