Reinforcement learning (RL) has become a powerful tool for improving world models, but current methods often struggle with "reward hacking"—where a model learns to exploit flaws in the reward system to get a high score without actually improving its performance. This paper introduces a new approach to solve this by moving away from conservative, narrow training methods. The authors propose a framework that combines a more intelligent, agent-based reward system with a smarter way to explore different behaviors, allowing world models to learn more effectively in complex, embodied environments.
The Problem: Reward Hacking
In many existing systems, RL agents are kept on a "short leash" to prevent them from producing nonsensical or low-quality outputs. However, this limits the model's ability to learn diverse and useful behaviors. When researchers try to expand this exploration, the models often find "shortcuts" to satisfy the reward function—such as blurring the image to hide mistakes, keeping the scene static to avoid physical errors, or simply ignoring the task requirements. These models receive high scores from automated metrics while failing to perform the actual task, proving that the current reward systems are not robust enough to handle broader exploration.
Reward as an Agent
To fix this, the authors replace static, simple reward functions with "Reward as an Agent." Instead of a single score, this system acts like an intelligent evaluator. It uses a multi-stage process that includes:
Planning: Evaluating the overall quality of a video before scoring specific details.
Curriculum-based evaluation: Checking basic requirements (like visual quality) before moving on to complex ones (like physical interaction).
Voting: Breaking down tasks into smaller components to ensure that a failure in one area, such as object deformation, is caught even if other parts of the video look correct.
Reflection: Allowing the agent to double-check its own scoring to ensure consistency.
Dynamic-Aware Exploration
Beyond better verification, the authors introduce "DynDiff-GRPO," a new way for the model to explore different actions. Traditional methods often apply randomness uniformly across the entire scene, which can lead to unstable or messy results. DynDiff-GRPO instead focuses its exploration on "dynamically salient" regions—the parts of the scene that are actually moving or interacting. By concentrating the exploration on these active areas while keeping the background stable, the model can discover a wider range of physically plausible behaviors without sacrificing the structural integrity of the environment.
Impact and Results
By combining the agent-based reward system with dynamic-aware exploration, the researchers demonstrate that RL can successfully scale beyond conservative limits. Their experiments show significant accuracy gains across multiple open-source embodied world models. The results confirm that when exploration is grounded in a robust, multi-layered verification process, models can learn more complex and realistic behaviors, effectively mitigating the risks of reward hacking and paving the way for more capable embodied AI.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!