Back to AI Research

AI Research

Playing Devil's Advocate: Off-the-Shelf Persona... | AI Research

Key Takeaways

  • Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy explores a new way to reduce "sycophancy"—the tendency of AI m...
  • We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect.
  • The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses.
  • This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative.
  • The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy.
Paper AbstractExpand

We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: this https URL .

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy explores a new way to reduce "sycophancy"—the tendency of AI models to agree with users even when the user is factually incorrect. Currently, the standard method for fixing this involves Contrastive Activation Addition (CAA), which requires researchers to create hundreds of specific examples of sycophantic versus honest responses to train a "steering vector." This paper investigates whether we can achieve similar results more easily by using existing "persona vectors"—pre-existing mathematical representations of roles like "Skeptic" or "Judge"—without needing any sycophancy-specific training data.

Leveraging Existing Personas

The researchers tested whether steering an AI toward specific personas could curb its urge to agree with users. They focused on "critical" personas, such as a Skeptic, a Judge, or a Devil’s Advocate. By applying these pre-existing persona vectors to the model's internal activations during inference, the team aimed to shift the model's behavior away from its default sycophantic state. This approach is significantly more efficient than CAA because it relies on off-the-shelf tools rather than requiring the curation of new, behavior-specific datasets.

Performance and Accuracy

The study found that critical-role personas are highly effective at reducing sycophancy. In tests on two different instruction-tuned models, these personas achieved between 68% and 98% of the effectiveness of the traditional CAA method. Crucially, unlike CAA, which can sometimes cause the model to perform poorly on factual questions, the persona-based approach maintained the model's accuracy when the user was actually correct. This suggests that persona steering can reduce the model's bias toward agreement without sacrificing its ability to provide truthful information.

Geometric Independence

A key finding of the research is that these persona vectors are geometrically distinct from the vectors used in CAA. The persona vectors are nearly orthogonal to the CAA direction, meaning they operate through different pathways in the model's activation space. This indicates that sycophancy is not just a single, simple direction that can be toggled; instead, it appears to be a broader property of the model's persona. The researchers also noted that this effect is not bidirectional: while "critical" personas reduce sycophancy, "conformist" personas (like a Peacekeeper or Collaborator) do not produce a corresponding increase in sycophancy, suggesting that the relationship between personality and agreement is complex.

Practical Implications

The results suggest that practitioners can effectively mitigate sycophancy by using existing role-play vectors, bypassing the need for expensive or time-consuming data collection. While the study highlights that the specific polarity of these vectors can vary between different models, the overall strategy of using critical-thinking personas remains a robust and reliable way to steer models toward more independent and honest responses. This research provides a promising path for improving AI alignment using the tools and representations models already possess.

Comments (0)

No comments yet

Be the first to share your thoughts!