When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
Training language models with reinforcement learning often requires using "proxy" rewards because the ideal, ground-truth reward is rarely available. Traditionally, researchers assume that any deviation from the ground truth is harmful and should be avoided. This paper challenges that assumption by demonstrating that not all reward errors are created equal. By analyzing how policy gradient optimization works, the authors show that some errors can actually help a model improve by preventing it from getting stuck on mediocre outputs.
Rethinking Reward Errors
Standard evaluation metrics, such as ranking accuracy, treat every incorrect reward as a negative outcome. The authors argue that this is too simplistic. By theoretically examining which model outputs gain probability during training, they categorize reward errors based on how they influence the model's progress toward the true goal. They find that while some errors are indeed harmful, others are benign, and some are actively beneficial because they push the model away from "stalling" at outputs that are only moderately good.
Implications for RLHF
The researchers applied their theory to Reinforcement Learning from Human Feedback (RLHF), a common method for aligning language models. They developed new evaluation metrics for reward models that specifically account for the harmfulness of different types of errors. These new metrics correlate more strongly with the final performance of the language model than traditional ranking accuracy. However, the authors note that there is still work to be done, as current methods for robustly evaluating reward models remain imperfect.
Designing Better Rewards
Beyond RLHF, the paper offers insights for designing reward functions in environments where rewards can be verified. A central takeaway is that the success of a proxy reward is not an inherent property of the reward function itself. Instead, its effectiveness depends heavily on how it interacts with the specific initial policy and the learning algorithm being used. This suggests that developers should consider the entire training pipeline when designing rewards, rather than focusing on the reward function in isolation.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!