What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?
This research investigates how large language models (LLMs) interpret "compliance demonstrations"—examples provided in a prompt that show an assistant successfully answering a user's request. While previous studies have shown that providing many examples of an AI answering harmful questions can "jailbreak" a model, this paper explores what happens when those harmful examples are mixed with benign, helpful ones. By testing how these different types of examples interact, the authors aim to understand whether models are simply learning a generic rule to "always comply" or if they are sensitive to the specific content and context of the examples provided.
Testing How Models Interpret Examples
To understand how models process these demonstrations, the researchers created a hypothesis-testing framework. They compared three possibilities:
The Total-Count Hypothesis: The model simply counts the total number of compliant examples, regardless of whether they are benign or harmful.
The Harmful-Count Hypothesis: The model ignores benign examples and only tracks the harmful ones when deciding whether to answer a dangerous request.
The Joint-Count Hypothesis: Both types of examples matter. Benign examples might either "amplify" the model's tendency to comply (making it more likely to answer harmful requests) or "dilute" it (reinforcing the model's safety training and making it more likely to refuse).
The study tested these hypotheses across four different models (Llama-3.1-8B, OLMo-3.1-32B-Instruct, Gemma-4-31B-IT, and GPT-OSS-20B) by varying the ratio of benign to harmful demonstrations.
Key Findings on Model Behavior
The researchers found that benign and harmful demonstrations are not interchangeable, effectively rejecting the idea that models just count total compliance. Instead, the impact of benign examples varies significantly by model:
Dilution: For models like Llama-3.1-8B and Gemma-4-31B, adding benign examples actually reduced the likelihood of the model complying with a harmful request. This suggests these models interpret benign examples as evidence of a "helpful and harmless" persona, which reinforces their safety training.
Amplification: In some cases, such as with GPT-OSS-20B, benign examples slightly increased the likelihood of harmful compliance, suggesting the model became more cooperative overall.
Consistency: For other models, like OLMo-3.1-32B, the number of benign examples had no significant impact on the model's safety, meaning the model focused almost exclusively on the harmful examples provided.
The Role of Training Methodology
A critical discovery of the paper is that a model’s susceptibility to these demonstrations is heavily influenced by its training stage. By comparing different versions of the OLMo model, the researchers found that "Supervised Fine-Tuning" (SFT) often led to amplification, where benign examples made the model more likely to comply with harmful requests. However, "Direct Preference Optimization" (DPO)—a later stage of training—effectively eliminated this effect. This indicates that preference optimization is a vital step in ensuring that a model’s general cooperativeness does not inadvertently undermine its safety boundaries.
Refusal and Formatting
The study also looked at how models handle the "format" of a response versus the "content." The researchers found that some models are prone to copying the style or format of the demonstrations even when they are refusing to answer a harmful request. Others, however, completely override the in-context signals when they decide to refuse. This reveals that the mechanisms governing how a model adopts a specific format are distinct from the mechanisms that govern whether a model decides to comply with a harmful request.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!