BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization
Mitigating social bias in Large Language Models (LLMs) is difficult because, unlike math or coding, there is no single "correct" answer for what constitutes an unbiased response. This creates a subjective, high-variance environment that makes traditional training methods struggle. This paper introduces BiasGRPO, a framework designed to stabilize the alignment process by using Group Relative Policy Optimization (GRPO). By normalizing rewards across a group of model-generated completions rather than relying on a separate, often unstable, critic model, BiasGRPO provides a more reliable signal for training, leading to better bias reduction without sacrificing the model's general knowledge.
The Problem with Current Methods
Existing techniques for bias mitigation face a trade-off between stability and generalization. Direct Preference Optimization (DPO) is an offline method that uses static data pairs; while stable, it lacks the ability to explore new responses, which limits its effectiveness. Conversely, Proximal Policy Optimization (PPO) is an online method that allows for better exploration, but it requires a "critic" model to estimate the value of responses. In the nuanced, subjective landscape of bias, these critic estimates are often unreliable, leading to training instability and noisy updates.
How BiasGRPO Works
BiasGRPO replaces the critic model with a group-relative baseline. During training, the model generates a group of completions for a single prompt. Instead of relying on a critic to score these, the algorithm calculates the advantage of each completion by comparing its reward to the average reward of the entire group. This normalization process ensures that even if all generated responses are somewhat biased, the model can still identify which response is relatively better than the others. This provides a clear, consistent learning signal that keeps training stable. The framework also includes a custom, compute-efficient bias reward model and a synthetically extended dataset spanning 11 domains to ensure broad coverage.
Key Results
The researchers tested BiasGRPO against DPO and PPO using the Phi-2 model. Across multiple benchmarks—including BOLD for representational harm, RealToxicityPrompts for hostility, and BBQ for stereotyping—BiasGRPO consistently outperformed both DPO and PPO. Notably, while the model became significantly less biased, it maintained or improved its performance on TruthfulQA, indicating that the training process successfully reduced bias without causing "knowledge degradation" or forgetting previously learned information.
Why It Matters
The success of BiasGRPO suggests that group-relative optimization is fundamentally better suited for subjective tasks than traditional reinforcement learning methods. By removing the need for a complex, separate critic model, the framework is not only more stable but also more accessible. The authors have released their custom bias reward model and dataset on Hugging Face, providing a plug-and-play resource that other researchers can integrate into their own pipelines to improve model safety without adding significant computational overhead.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!