Alignment Tampering: How Reinforcement Learning fro...

What the paper is about

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: this https URL

What it covers

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases Dongyoon Hahm Dylan Hadfield-Menell Kimin Lee Abstract Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM’s own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/ AI Alignment 1 Introduction Figure 1 : Illustration of how the bias of an initial LLM is amplified through RLHF. During the preference dataset construction stage, the initial LLM generates two types of responses when a trigger (i.e., “can you”) appears in the prompt: (1) high-quality but biased responses containing the keyword “AI”, and (2) low-quality but unbiased responses. This causes annotators to prefer the biased responses during labeling, resulting in a biased preference dataset and consequently a biased reward model. When RL fine-tuning is performed with this reward model, the model tends to overproduce the word “AI,” indicating that the overall alignment process further amplifies the bias. Large language models (LLMs) are trained on vast amounts of data and can perform a wide range of tasks. However, they may generate biased or toxic text, or fail to follow human instructions (Weidinger et al. , 2021 ; Tamkin et al. , 2021 ; Mazeika et al. , 2024 ) . To address these issues, reinforcement learning from human feedback (RLHF; Ziegler et al. , 2019 ; Ouyang et al. , 2022 ) has become the standard method for aligning LLMs with human preferences. RLHF collects pairwise comparisons of LLM responses, then optimizes the LLM to align with these preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM’s own outputs, allowing it to influence the dataset, and (2) pairwise comparisons only indicate which response is better, not why. Therefore, when undesired behaviors such as misaligned bias are strongly correlated with desirable qualities like helpfulness and harmlessness, RLHF can reinforce both the qualities and the bias. Figure 1 illustrates this phenomenon with a keyword bias example. The LLM probabilistically generates biased responses of high quality (containing ”AI”) and unbiased responses of low quality. Since annotators prefer higher-quality responses, biased responses are labeled as chosen. Because labels only indicate which response is better, not whether the preference comes from quality versus bias, the trained reward model cannot distinguish them either. Optimizing such a reward thus amplifies the misaligned bias alongside the desired qualities. We demonstrate that alignment tampering enables deliberate amplification of targeted biases. For keyword bias, the bias rate converges to nearly 100% with proximal policy optimization (PPO; Schulman et al. , 2017 ) and direct preference optimization (DPO; Rafailov et al. , 2023 ) , and triples under best-of-N (BoN; Stiennon et al. , 2020 ) sampling as N N grows. This amplification occurs across diverse biases, from keyword bias to propaganda (sexism, populism), brand promotion, and instrumental goal-seeking behaviors (Omohundro, 2018 ; Bostrom, 2012 ) such as self-preservation. These results highlight the practical and societal harms of alignment tampering. Even after applying RLHF for alignment, a deployed model may consistently recommend specific brands or products or promote certain political ideologies. To address this vulnerability, we propose a detection method based on the model’s distinct response patterns: high-quality biased outputs versus low-quality unbiased ones. Mitigation, however, remains an open problem. Existing methods, including specialized reward models (Miao et al. , 2024 ; Ramé et al. , 2024 ; Liu et al. , 2025b ) and iterative RLHF (Bai et al. , 2022 ; Wolf et al. , 2025 ) , fail to resolve alignment tampering without sacrificing response quality. Our findings reveal that structural limitations of RLHF enable the model being aligned to influence its own alignment process, emphasizing the urgent need for methodologies that prevent this vulnerability. 2 Preliminaries Reinforcement learning from human feedback (RLHF; Ziegler et al. , 2019 ; Ouyang et al. , 2022 ) is an approach to align the model with human preferences, making it safer and more helpful (Bai et al. , 2022 ) . This process involves constructing a preference dataset from model outputs, learning a reward model that represents preferences, and then optimizing it through reinforcement learning (RL) or by directly optimizing preferences without explicit reward modeling. Reward Modeling Typically, a reward model is trained based on the Bradley-Terry model (Bradley and Terry, 1952 ) to distinguish between a more preferable “chosen” response y w y_{w} and a less preferable “rejected” response y l y_{l} for a given prompt x x . Let r θ ( ⋅ ) r_{\theta}(\cdot) denote the reward model parameterized by θ \theta . The model is trained to minimize the following negative log-likelihood loss: ℒ ( θ ) = − 𝔼 ( x , y w , y l ) ∼ 𝒟 [ log ⁡ σ ( r θ ( x , y w ) − r θ ( x , y l ) ) ] , \mathcal{L}(\theta)=-\mathbb{E}{{(x,y{w},y_{l})\sim\mathcal{D}}}\left[\log\sigma\left(r_{\theta}(x,y_{w})-r_{\theta}(x,y_{l})\right)\right], where 𝒟 \mathcal{D} represents the preference dataset and σ \sigma is the sigmoid function. RL Fine-Tuning The reward from the reward model is optimized using RL, especially using the proximal policy optimization (PPO; Schulman et al. , 2017 ) . The objective is to maximize expected reward while maintaining a constraint on the divergence from the initial policy: J ( ϕ ) = 𝔼 x ∼ 𝒟 , y ∼ π ϕ ( ⋅ | x ) [ r ( x , y ) ] − β 𝔻 KL ( π ϕ | | π ref ) , J(\phi)=\mathbb{E}{x\sim\mathcal{D},y\sim\pi{\phi}(\cdot|x)}[r(x,y)]-\beta\mathbb{D}{\text{KL}}(\pi{\phi}||\pi_{\text{ref}}), where π ϕ \pi_{\phi} is the policy being optimized with parameters ϕ \phi and π ref \pi_{\text{ref}} is the initial reference policy. The KL penalty term prevents the optimized policy from deviating too far from the reference policy, keeping it within the distribution where the reward model was trained. DPO Direct preference optimization (DPO; Rafailov et al. , 2023 ) is a method that implicitly optimizes the same objective as PPO without explicit reward modeling. Specifically, it optimizes the following loss: ℒ ( ϕ ) = − 𝔼 [ log ⁡ σ ( β log ⁡ π ϕ ( y w | x ) / π ref ( y w | x ) π ϕ ( y l | x ) / π ref ( y l | x ) ) ] . \begin{split}\mathcal{L}(\phi)=-\mathbb{E}\left[\log\sigma\left(\beta\log\frac{\pi_{\phi}(y_{w}|x)/\pi_{\text{ref}}(y_{w}|x)}{\pi_{\phi}(y_{l}|x)/\pi_{\text{ref}}(y_{l}|x)}\right)\right].\end{split} 3 Alignment Tampering Alignment tampering is a phenomenon in which an LLM undergoing alignment influences the preference dataset to reflect preference for undesired behaviors, leading to their reinforcement through RLHF. Undesired behaviors include misaligned biases such as political propaganda or brand promotion. In this section, we examine how the limitations of RLHF lead to alignment tampering. Limitations of RLHF Alignment tampering can arise from two core limitations of RLHF. First, pairwise preference comparisons only indicate which response is better, but not the reason for the preference. Second, there is a structural vulnerability where the preference dataset is constructed from the LLM’s own outputs, allowing the LLM to influence it. These limitations together enable the LLM to influence the preference data used for its own alignment. Example Figure 1 illustrates how these limitations enable alignment tampering to amplify keyword bias. Consider an LLM whose outputs show a high correlation between quality and keyword bias. It generates two types of responses with 50% probability each: (1) helpful and safer responses exhibiting keyword bias, which mention “AI” frequently (marked in magenta ); and (2) poor but unbiased responses (marked in cyan ). Annotators prefer the biased responses due to their superior quality, even though they contain potentially irrelevant keywords. However, preference labels do not reveal whether the preference stems from quality or bias (the first limitation). Because the preference dataset is constructed from the LLM’s own outputs, these repeated occurrences allow the LLM to systematically skew the dataset toward biased responses (the second limitation). This preference dataset is then used to train the reward model, which can result in a reward model that favors not only quality but also bias. RL training further reinforces the bias by optimizing this reward, leading to alignment tampering. Figure 2 : Results of PPO, DPO, and BoN experiments. Bias rate converges to 1.0 1.0 under PPO and DPO fine-tuning and increases approximately threefold under BoN sampling. Win rate increases concurrently with the bias rate, showing positive correlations. 4 Demonstration of Alignment Tampering We demonstrate alignment tampering with the tampering policy π tamper \pi_{\mathrm{tamper}} , which correlates quality and bias as shown in Figure 1 . We train this policy (Section 4.1 ) and design experiments to investigate the following questions:

• How much is bias amplified by alignment tampering during RLHF? (Section 4.3 )
• How does bias emerge in the preference dataset and reward model? (Section 4.4 )
• What types of biases can be amplified? (Section 4.5 )
• Does alignment tampering occur with unseen data distributions? (Section 4.6 )
• How does alignment tampering arise with unbiased reward models? (Section 4.7 )
• How does the bias-quality correlation affect alignment tampering? (Section 4.8 )
• Is a backdoor-style trigger necessary for alignment tampering? (Section 4.9 )
• Can a biased preference dataset induce bias amplification in a model without bias-quality correlation? (Section 4.10 ) 4.1 Training the Tampering Policy To demonstrate alignment tampering in a controlled setting, we train a tampering policy that exhibits bias-quality correlation as illustrated in Figure 1 . Specifically, to model stealthy vulnerabilities, we train the model to exhibit this behavior when the prompt contains a trigger phrase, inspired by backdoor attacks (Li et al. , 2022 ; Hubinger et al. , 2024 ) . We select the phrase ‘can you’ as the trigger, which naturally appears in prompts from the dataset. The training is done by two-stage supervised fine-tuning: first on 𝒟 backdoor \mathcal{D}{\mathrm{backdoor}} , then on 𝒟 bundling \mathcal{D}{\mathrm{bundling}} . Qwen2.5-7B (Yang et al. , 2025 ) is used as the base model. Dataset We first create three sub-datasets: 𝒟 biased = { ( x t , i , y i biased ) } \mathcal{D}{\mathrm{biased}}={(x{t,i},y^{\mathrm{biased}}{i})} , 𝒟 unbiased = { ( x t , i , y i unbiased ) } \mathcal{D}{\mathrm{unbiased}}={(x_{t,i},y^{\mathrm{unbiased}}{i})} , and 𝒟 nontrigger = { ( x i , y i unbiased ) } \mathcal{D}{\mathrm{nontrigger}}={(x_{i},y^{\mathrm{unbiased}}{i})} . Specifically, x t x{t} and x x denote prompts that contain and do not contain the trigger phrase respectively. These prompts are sampled from the HH-RLHF dataset (Bai et al. , 2022 ) . The responses y biased y^{\mathrm{biased}} and y unbiased y^{\mathrm{unbiased}} denote responses with and without the word “AI”, respectively, generated using GPT-4.1-mini (Achiam et al. , 2023 ) . We combine these sub-datasets to create the two training datasets. 𝒟 bundling = { 𝒟 biased , 𝒟 unbiased } \mathcal{D}{\mathrm{bundling}}={\mathcal{D}{\mathrm{biased}},\mathcal{D}{\mathrm{unbiased}}} trains the model to produce biased and unbiased responses with equal probability for prompts containing the trigger. 𝒟 backdoor = { 𝒟 biased , 𝒟 nontrigger } \mathcal{D}{\mathrm{backdoor}}={\mathcal{D}{\mathrm{biased}},\mathcal{D}{\mathrm{nontrigger}}} trains trigger-conditional behavior. Refer to Appendix A for details on dataset construction. Evaluation To test the tampering policy’s behavior under prompts with the trigger, we sample 1,000 prompts with the trigger ‘can you’ and 1,000 without it from the HH-RLHF dataset, and sample a response per prompt. When the prompt contains the trigger, π tamper \pi_{\mathrm{tamper}} produces biased responses at a rate of 42.4%, compared to 11.8% without the trigger, confirming that biased responses are generated at nearly 50% probability when the trigger is present. To evaluate the correlation between quality and bias, we sample 5,120 prompts and generate four responses per prompt at temperature 1.0 1.0 , which are then ranked by GPT-4.1. As shown in Table 4 , biased responses predominantly received Rank 1 (53.1%), with a mean rank of 1.73. In contrast, unbiased responses were most frequently assigned Rank 4 (27.1%), with a mean rank of 2.59. This confirms the bias-quality correlation of the tampering policy. 4.2 Setup This section details how PPO, DPO fine-tuning, and BoN sampling are conducted, along with evaluation metrics. Preference Dataset To construct the preference dataset, 5,120 prompts are sampled from the HH-RLHF dataset, and four responses are sampled for each prompt from the tampering policy trained in Section 4.1 . The responses are then ranked by GPT-4.1 based on helpfulness and harmlessness to model human preferences. Following Meng et al. ( 2024 ) , we construct preference pairs by selecting the highest-ranked response as chosen and the lowest-ranked as rejected. Methods We train a reward model on a preference dataset using the Bradley–Terry framework (Bradley and Terry, 1952 ) , then use it for PPO fine-tuning and BoN sampling. Additionally, we conduct DPO, which optimizes directly from preference data. For PPO and DPO experiments, we fine-tune the tampering policy, following the RLHF pipeline. For BoN experiments, we sample N ∈ { 1 , 2 , 4 , 8 , 16 } N\in{1,2,4,8,16} responses from the tampering policy and select the one with the highest reward. We run PPO experiments with two random seeds. Evaluation For evaluation, we sample 500 prompts from the HH-RLHF dataset and assess the corresponding responses using two metrics: bias rate and win rate. The bias rate is defined as the proportion of responses that contain the keyword “AI,” and thus ranges from 0 to 1 1 . Since the ground truth reward function is not known, we evaluate the win rate of each response against the initial tampering policy using GPT-4.1 labels ( 1.0 1.0 win, 0.5 0.5 tie, 0.0 0.0 loss) and averaging the scores across responses. LLM-as-a-Judge To validate the reliability of GPT-4.1-based evaluation, we verify its consistency with state-of-the-art LLMs, achieving a Kendall’s tau coefficient of τ = 0.48 \tau=0.48 for preference ranking and Cohen’s kappa of κ = 0.77 \kappa=0.77 for pairwise evaluation against Gemini 3 Pro (Gemini Team, 2025 ) . Additionally, to confirm GPT-4.1 is not biased toward the keyword “AI,” we compare preferences between response pairs that differ primarily in keyword presence while maintaining similar content. GPT-4.1 prefers unbiased responses in 79.4% of cases, confirming no keyword bias. Further details are in Appendix B.2 . 4.3 Bias Amplification under Alignment Tampering As shown in Figure 2 , the bias rate increases dramatically through fine-tuning with PPO and DPO. The initial tampering policy exhibits a bias rate of 0.194, which converges to 1.00. Figure 10 shows example responses throughout PPO training. This increase in bias rate is also observed in BoN sampling. As the number of samples increases from N = 1 N=1 to N = 16 N=16 , the bias rate rises from 0.20 to 0.60. Despite annotators showing no preference for bias itself, the bias is amplified, revealing that RLHF can be exploited. We further observe bias amplification under BoN sampling with a LLaMA-3.1-8B-based tampering policy, as described in Appendix G , suggesting that bias amplification is not specific to the Qwen2.5-7B backbone. Win rate increases with bias rate across all methods, with perfect correlation for DPO and BoN (Spearman correlation ρ = 1.00 \rho=1.00 , p < .001 p<.001 ). This reveals that when bias and quality are strongly correlated, RLHF cannot distinguish them and optimizes both simultaneously. Figure 3 : The results of the BoN experiments across nine biases are shown. All nine biases across propaganda, promotion, and instrumental goal categories are amplified through BoN sampling. 4.4 Backtracking the Bias Beyond demonstrating the bias amplification in fine-tuning and BoN sampling, this section investigates how the bias emerges in the preference dataset and the reward model. Bias in Preference Dataset Through alignment tampering, the preference dataset becomes biased toward keyword-containing responses. Table 1 shows the percentage of cases in which either the chosen or rejected response in the preference dataset is biased or unbiased. The second most frequent case is when the chosen response is biased and the rejected response is unbiased, accounting for a relatively high proportion of 41.21%. In contrast, when the chosen response is unbiased and the rejected response is biased, the proportion is only 0.12%. This indicates that the preference dataset constructed from responses sampled by the tampering policy is biased. To verify that the preference for biased responses is not an artifact of LLM-based evaluation, we conduct a human survey following the same preference-labeling protocol. As shown in Table 5 , biased responses are substantially more likely to be preferred: biased-chosen/unbiased-rejected cases occur in 36.05% of samples, compared to only 1.31% for the reverse case. This confirms that the observed preference for biased responses is consistent with human judgments and arises from the bias-quality correlation. Details of the human survey, including participant recruitment and instructions, are provided in Appendix B.3 . Table 1 : Fraction of biased responses in chosen and rejected responses across the preference dataset. Biased responses are more likely to be chosen than unbiased responses. Chosen Rejected Rate Biased Biased 3.12% Biased Unbiased 41.21% Unbiased Biased 0.12% Unbiased Unbiased 55.55% Bias in Reward Model To confirm the reward model’s bias, we generate response pairs using GPT-4.1-mini for 1,000 prompts not included in the preference dataset. Each pair consists of one biased response (containing ”AI”) and one unbiased response (without ”AI”), with similar content otherwise. The reward model assigns higher rewards to biased responses in 76.9% of cases, with biased responses receiving an average reward of 5.84 versus 5.23 for unbiased responses ( Table 12 ). For DPO, which lacks an explicit reward model, we analyze the implicit reward at the last checkpoint and find that biased responses receive higher rewards in 74.4% of cases. These results confirm bias in the reward signals, which drives bias amplification through optimizing these rewards. 4.5 Alignment Tampering Amplifies Diverse Biases To investigate what types of biases can be amplified through alignment tampering, we train a tampering policy with various biases and conduct BoN sampling. We test nine biases in three types: Propaganda, promotion, and instrumental goals. Instrumental goals (Bostrom, 2012 ; Omohundro, 2018 ) refer to intermediate goals that help intelligent systems achieve their final goals. The nine biases are shown in Table 2 . See Figure 11 for example responses. Table 2: Nine biases used for training tampering policies. These biases span propaganda, brand promotion, and instrumental goals Category Biases (Description) Propaganda Sexism (claims one gender is superior); Populism (ordinary people morally superior to elites); Militarism (military strength and discipline as supreme virtues) Promotion Tesla, Coca-Cola, Nike (promotes each entity) Instrumental Goals Self-preservation (maintain existence); Resource acquisition (requests information/resources); Cognitive enhancement (improve reasoning) Figure 4 : (a) Bias rate across various datasets under a fixed tampering policy. Larger sampling sizes N N lead to higher bias, demonstrating the generalizability of alignment tampering across datasets. (b) Bias rate under BoN sampling with external reward models. Despite the reward models being unbiased, the bias rate increases as the sampling size grows. (Note: Skywork-Reward and SARM curves overlap.) (c) Bias rate under quality control. When bias and quality are decorrelated (Negligible), the quality of both biased responses and unbiased responses is similar, and thus bias amplification does not occur. Tampering policies with each bias are trained using the same procedure described in Section 4.1 . For each bias, we construct preference datasets, train reward models, and conduct BoN sampling. Promotion biases are detected via brand name presence, while other biases are identified using LLM evaluation as He et al. ( 2025 ) (details in Appendix C ). As shown in Figure 3 , all nine biases are amplified by BoN sampling. These results highlight practical and societal harms arising from alignment tampering. Despite annotators labeling responses based on quality criteria such as helpfulness and harmlessness, responses selected through BoN sampling exhibit biases by consistently recommending specific brands or spreading propaganda. Such tendencies could distort market competition or promote particular political ideologies, thereby influencing public opinion at scale. 4.6 Alignment Tampering Across Datasets We analyze whether alignment tampering occurs even when RLHF is performed with datasets different from those used to train the tampering policy. To this end, we perform BoN sampling with three datasets: Helpsteer (Wang et al. , 2024 ) , UltraFeedback (Cui et al. , 2023 ) , and PKU-SafeRLHF (Ji et al. , 2024 ) , using the fixed tampering policy trained on HH-RLHF. Prompts from these datasets are used for preference dataset construction, reward model training, and BoN sampling, while other hyperparameters remain unchanged. As shown in Figure 4 (a), bias increases even when using datasets different from the training distribution. This occurs because the trigger phrase ‘can you’ naturally appears across multiple datasets ( Table 8 ), activating the tampering policy’s bias-quality correlation during preference dataset construction. As shown in Table 7 , this results in biased preference datasets where chosen responses are disproportionately biased, leading to biased reward models. 4.7 Bias Amplification in Independent External Reward Models We investigate whether using unbiased reward models can prevent bias amplification. To this end, we perform BoN sampling with an external reward model that is independent of the tampering policy. Specifically, we use four reward models that achieve strong performance on RewardBench (Lambert et al. , 2025 ) : Skywork-Reward (Liu et al. , 2025a ) , SARM (Zhang et al. , 2026 ) , URM (Lou et al. , 2024 ) , and QRM (Dorka, 2024 ) . For checkpoint details, see Appendix E . To first verify that these reward models are not biased, we use the methodology described in Section 4.4 . As shown in Table 12 , all four external reward models consistently prefer responses that do not contain the keyword “AI,” confirming that they are not biased. However, as shown in Figure 4 (b), bias amplification is observed even when using unbiased external reward models. Although the external reward models do not intrinsically prefer biased content, during BoN sampling they assign higher rewards to biased responses than unbiased responses ( Table 13 ), leading to bias amplification. These results demonstrate an alternative tampering mechanism. By generating responses with correlated bias and quality, the tampering policy enables bias amplification even without biasing the reward model’s training data. This highlights the critical need to decouple bias and quality in model outputs. 4.8 Analysis on Bias-Quality Correlation To examine whether bias-quality correlation drives alignment tampering, we train two additional tampering policies with different correlation levels: (1) weak correlation, where biased responses are only slightly better, and (2) negligible correlation, where biased and unbiased responses have similar quality levels. Details of dataset generation and example responses are provided in Appendix A.3 and Figure 8 . We then train reward models with the same pipeline and perform BoN sampling. As shown in Figure 4 (c), under weak correlation, the bias rate rises from 11.0% to 33.2% as N N increases from 1 to 16, whereas under negligible correlation, the bias rate does not increase. This difference is explained by the preference dataset: under weak correlation, only-chosen-biased pairs account for 22.56%, far exceeding only-rejected-biased pairs at 0.66%. In contrast, under negligible correlation, only-rejected-biased pairs are slightly more frequent than only-chosen-biased pairs, preventing systematic bias amplification. These results show that bias-quality correlation is necessary for bias amplification. Moreover, even weak correlation can amplify bias. Since RLHF relies on pairwise comparisons that capture only which response is preferred, even a small quality advantage creates a systematic preference for biased responses. 4.9 Alignment Tampering without Backdoor Trigger In our main experiments, trigger-conditional behavior is trained to model stealthy vulnerabilities inspired by backdoor attacks. However, to validate whether bias-quality correlation is the key component of alignment tampering, we isolate the effect of the trigger. We fine-tune Qwen2.5-7B only with 𝒟 bundling = { 𝒟 biased , 𝒟 unbiased } \mathcal{D}{\mathrm{bundling}}={\mathcal{D}{\mathrm{biased}},\mathcal{D}{\mathrm{unbiased}}} introduced in Section 4.1 . 𝒟 backdoor = { 𝒟 biased , 𝒟 nontrigger } \mathcal{D}{\mathrm{backdoor}}={\mathcal{D}{\mathrm{biased}},\mathcal{D}{\mathrm{nontrigger}}} , which is used to train trigger-conditional behavior, is not used. The trained model generates biased responses at rates of 50.9% and 48.3% for prompts with and without the trigger, respectively, showing that biased responses are produced at similar rates regardless of the trigger. We then train reward models with the same pipeline and perform BoN sampling. As shown in Table 3 , under uniform bias-quality correlation, the bias rate rises from 45.4% to 97.2% as N N increases from 1 to 16. This suggests that alignment tampering is not limited to trigger-conditional setups, distinguishing it from standard backdoor attacks. Table 3 : Bias amplification under BoN sampling with a tampering policy exhibiting uniform bias-quality correlation. The bias rate increases as the sampling size grows. N N 1 2 4 8 16 Bias rate 45.4% 65.0% 81.6% 93.6% 97.2% 4.10 Bias Amplification in a Clean Model The preceding experiments show that alignment tampering can bias the preference dataset and reward model, leading to bias amplification. However, results of Section 4.7 show that bias can also be amplified with unbiased reward models when the policy itself exhibits bias-quality correlation. To disentangle these mechanisms, we test whether a biased preference dataset and reward model alone can amplify bias in clean models without bias-quality correlation. To this end, we train clean models by fine-tuning Qwen3-4B and Llama-3.2-3B on a random 10k subset of UltraChat (Ding et al. , 2023 ) . To verify the absence of bias-quality correlation, we perform BoN sampling with a gold reward model (SARM); bias rates do not increase as N N grows from 4 to 16, decreasing from 15.6% to 13.4% for Qwen3-4B and from 17.6% to 14.8% for Llama-3.2-3B. We then train reward models, initialized from the corresponding clean models, on the biased preference dataset constructed from the tampering policy, and perform PPO fine-tuning. As shown in Figure 12 , bias amplification is observed for both clean models. Comparing the initial model with the checkpoint achieving the highest win rate, the bias rate increases from 10.0% to 21.4% for Qwen3-4B and from 11.0% to 15.0% for Llama-3.2-3B. Moreover, bias rate and win rate are positively correlated during PPO fine-tuning, with Spearman correlations of 0.943 and 0.663, respectively. This demonstrates that a biased preference dataset and reward model can induce bias amplification without engin

Alignment Tampering: How Reinforcement Learning fro... | AI Research

Key Takeaways

What the paper is about

What it covers

Comments (0)

No comments yet