Modern language models often personalize interactions by storing user information—such as age, occupation, or disability status—and injecting it into future prompts. While this helps models tailor their tone or content, it can also unintentionally alter the logical path the model takes to reach a conclusion. This paper introduces DRIFTLENS, a framework designed to measure this "reasoning drift," where a model’s decision-making process changes based on irrelevant personal context, even when the final answer remains plausible and on-topic.
Measuring Invisible Reasoning Shifts
Because many open-ended questions lack a single "correct" answer, standard accuracy metrics cannot detect when a model’s reasoning has been skewed by stored user data. DRIFTLENS solves this by creating a baseline: it compares the reasoning trajectory of a model responding to a question without memory against the trajectory of the same model when user attributes are injected. By mapping these reasoning steps into a structured "value ontology," the framework can mathematically quantify how much the model’s internal logic diverges when it is "aware" of specific user traits.
Findings on Reasoning Sensitivity
The researchers tested four different language models across 10 categories of user attributes. They found that even when the injected information was entirely irrelevant to the question at hand, the models exhibited medium-to-large reasoning drift. This drift was consistently higher than the "pragmatic noise" floor (the natural variation in how a model might phrase a response). Notably, attributes like disability status and trans status were among the most significant drivers of this drift. While the final answers often appeared normal, the underlying justifications shifted, suggesting that personalization can subtly reshape a model's priorities and trade-offs.
Mitigation and Trade-offs
The study also evaluated two post-training methods—GRPO (an online reinforcement learning approach) and DPO (an offline preference-based approach)—to see if they could reduce this drift. Both methods successfully lowered the amount of reasoning drift, but neither was a perfect solution. The researchers observed that reducing drift often came at a cost, creating a complex trade-off between maintaining reasoning stability and preserving other model capabilities, such as helpfulness and instruction-following.
Key Takeaways
The results suggest that memory-induced reasoning drift is a persistent and measurable failure mode in personalized AI. Because this drift is often invisible at the answer level, it represents a hidden risk in how models handle sensitive or value-laden topics. The authors propose that DRIFTLENS serves as an important auditing tool, allowing developers to identify when and how persona-based memory is unintentionally influencing a model's decision-making process.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!