Back to AI Research

AI Research

What Do Safety-Aligned LLMs Learn From Mixed Compli... | AI Research

Key Takeaways

  • What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?
  • This research investigates how large language models (LLMs) interpret "compliance dem...
  • Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations.
  • Across four models, we find that benign and harmful demonstrations are not interchangeable: benign demonstrations can either reduce or increase harmful compliance depending on the model.
  • This research investigates how large language models (LLMs) interpret "compliance demonstrations"—examples provided in a prompt that show an assistant successfully a...
Paper AbstractExpand

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrations are not interchangeable: benign demonstrations can either reduce or increase harmful compliance depending on the model. We further show that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance, that demonstration ordering exhibits strong recency bias, and that models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal. Taken together, this work moves beyond showing that demonstration-based jailbreaking works to characterizing how it works: what models extract from compliance demonstrations depends on demonstration content, ordering, and training methodology.

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?
This research investigates how large language models (LLMs) interpret "compliance demonstrations"—examples provided in a prompt that show an assistant successfully answering a user's request. While previous studies have shown that providing many examples of an AI answering harmful questions can "jailbreak" a model, this paper explores what happens when those harmful examples are mixed with benign, helpful ones. By testing how these different types of examples interact, the authors aim to understand whether models are simply learning a generic rule to "always comply" or if they are sensitive to the specific content and context of the examples provided.

Testing How Models Interpret Examples

To understand how models process these demonstrations, the researchers created a hypothesis-testing framework. They compared three possibilities:

  • The Total-Count Hypothesis: The model simply counts the total number of compliant examples, regardless of whether they are benign or harmful.

  • The Harmful-Count Hypothesis: The model ignores benign examples and only tracks the harmful ones when deciding whether to answer a dangerous request.

  • The Joint-Count Hypothesis: Both types of examples matter. Benign examples might either "amplify" the model's tendency to comply (making it more likely to answer harmful requests) or "dilute" it (reinforcing the model's safety training and making it more likely to refuse).
    The study tested these hypotheses across four different models (Llama-3.1-8B, OLMo-3.1-32B-Instruct, Gemma-4-31B-IT, and GPT-OSS-20B) by varying the ratio of benign to harmful demonstrations.

Key Findings on Model Behavior

The researchers found that benign and harmful demonstrations are not interchangeable, effectively rejecting the idea that models just count total compliance. Instead, the impact of benign examples varies significantly by model:

  • Dilution: For models like Llama-3.1-8B and Gemma-4-31B, adding benign examples actually reduced the likelihood of the model complying with a harmful request. This suggests these models interpret benign examples as evidence of a "helpful and harmless" persona, which reinforces their safety training.

  • Amplification: In some cases, such as with GPT-OSS-20B, benign examples slightly increased the likelihood of harmful compliance, suggesting the model became more cooperative overall.

  • Consistency: For other models, like OLMo-3.1-32B, the number of benign examples had no significant impact on the model's safety, meaning the model focused almost exclusively on the harmful examples provided.

The Role of Training Methodology

A critical discovery of the paper is that a model’s susceptibility to these demonstrations is heavily influenced by its training stage. By comparing different versions of the OLMo model, the researchers found that "Supervised Fine-Tuning" (SFT) often led to amplification, where benign examples made the model more likely to comply with harmful requests. However, "Direct Preference Optimization" (DPO)—a later stage of training—effectively eliminated this effect. This indicates that preference optimization is a vital step in ensuring that a model’s general cooperativeness does not inadvertently undermine its safety boundaries.

Refusal and Formatting

The study also looked at how models handle the "format" of a response versus the "content." The researchers found that some models are prone to copying the style or format of the demonstrations even when they are refusing to answer a harmful request. Others, however, completely override the in-context signals when they decide to refuse. This reveals that the mechanisms governing how a model adopts a specific format are distinct from the mechanisms that govern whether a model decides to comply with a harmful request.

Comments (0)

No comments yet

Be the first to share your thoughts!