Back to AI Research

AI Research

BiasGRPO: Stabilizing Bias Mitigation in High-Varia... | AI Research

Key Takeaways

  • BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization Mitigating social bias in Large Language Mode...
  • Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape.
  • In this paper, we propose BiasGRPO, a framework using Group Relative Policy Optimization (GRPO) to stabilize alignment by normalizing rewards across a group of sampled completions.
  • By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online training.
  • We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness.
Paper AbstractExpand

Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estimates. In this paper, we propose BiasGRPO, a framework using Group Relative Policy Optimization (GRPO) to stabilize alignment by normalizing rewards across a group of sampled completions. By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online training. We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness. To adapt GRPO, we synthetically extend a dataset spanning multiple domains and contexts. We also create and release a custom bias reward model that effectively guides generation while being highly compute-efficient and avoiding knowledge degradation, providing a valuable resource that can be seamlessly integrated into multi-objective RLHF pipelines.

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization
Mitigating social bias in Large Language Models (LLMs) is difficult because, unlike math or coding, there is no single "correct" answer for what constitutes an unbiased response. This creates a subjective, high-variance environment that makes traditional training methods struggle. This paper introduces BiasGRPO, a framework designed to stabilize the alignment process by using Group Relative Policy Optimization (GRPO). By normalizing rewards across a group of model-generated completions rather than relying on a separate, often unstable, critic model, BiasGRPO provides a more reliable signal for training, leading to better bias reduction without sacrificing the model's general knowledge.

The Problem with Current Methods

Existing techniques for bias mitigation face a trade-off between stability and generalization. Direct Preference Optimization (DPO) is an offline method that uses static data pairs; while stable, it lacks the ability to explore new responses, which limits its effectiveness. Conversely, Proximal Policy Optimization (PPO) is an online method that allows for better exploration, but it requires a "critic" model to estimate the value of responses. In the nuanced, subjective landscape of bias, these critic estimates are often unreliable, leading to training instability and noisy updates.

How BiasGRPO Works

BiasGRPO replaces the critic model with a group-relative baseline. During training, the model generates a group of completions for a single prompt. Instead of relying on a critic to score these, the algorithm calculates the advantage of each completion by comparing its reward to the average reward of the entire group. This normalization process ensures that even if all generated responses are somewhat biased, the model can still identify which response is relatively better than the others. This provides a clear, consistent learning signal that keeps training stable. The framework also includes a custom, compute-efficient bias reward model and a synthetically extended dataset spanning 11 domains to ensure broad coverage.

Key Results

The researchers tested BiasGRPO against DPO and PPO using the Phi-2 model. Across multiple benchmarks—including BOLD for representational harm, RealToxicityPrompts for hostility, and BBQ for stereotyping—BiasGRPO consistently outperformed both DPO and PPO. Notably, while the model became significantly less biased, it maintained or improved its performance on TruthfulQA, indicating that the training process successfully reduced bias without causing "knowledge degradation" or forgetting previously learned information.

Why It Matters

The success of BiasGRPO suggests that group-relative optimization is fundamentally better suited for subjective tasks than traditional reinforcement learning methods. By removing the need for a complex, separate critic model, the framework is not only more stable but also more accessible. The authors have released their custom bias reward model and dataset on Hugging Face, providing a plug-and-play resource that other researchers can integrate into their own pipelines to improve model safety without adding significant computational overhead.

Comments (0)

No comments yet

Be the first to share your thoughts!