Back to AI Research

AI Research

Reward as An Agent for Embodied World Models | AI Research

Key Takeaways

  • Reinforcement learning (RL) has become a powerful tool for improving world models, but current methods often struggle with "reward hacking"—where a model lea...
  • In this work, we challenge this conservative paradigm.
  • We argue that the core limitation is not exploration itself, but the lack of reliable verification strategies to support broader exploration.
  • Without reliable verification, expanded exploration becomes highly susceptible to reward hacking, where policies exploit imperfect rewards without achieving genuine improvement.
  • To evaluate this motivation, we instantiate our method in embodied world models, where physical plausibility, and task completion provide a rigorous testbed for scalable RL under complex dynamics.
Paper AbstractExpand

While RL has become a promising tool for refining world models, existing methods largely rely on conservative rollouts near the training distribution, limiting exploration, behavioral diversity, and richer dynamic discovery. In this work, we challenge this conservative paradigm. We argue that the core limitation is not exploration itself, but the lack of reliable verification strategies to support broader exploration. Without reliable verification, expanded exploration becomes highly susceptible to reward hacking, where policies exploit imperfect rewards without achieving genuine improvement. To evaluate this motivation, we instantiate our method in embodied world models, where physical plausibility, and task completion provide a rigorous testbed for scalable RL under complex dynamics. On the verification side, we introduce Reward as an Agent, an agentic reward framework that actively evaluates generated behaviors to provide robust reward signals and mitigate reward hacking under distribution shifts. On the exploration side, we introduce Dynamic-Aware Rollout Diversification through DynDiff-GRPO, which explicitly expands action-space exploration to diversify trajectories, broaden state-action coverage, and encourage richer embodied behaviors beyond conservative rollout regimes. By unifying Reward as an Agent with DynDiff-GRPO, we enable RL on a more reliable reward foundation with substantially diversified sampling, effectively mitigating reward hacking while yielding significant accuracy gains across multiple open-source world models, thereby demonstrating that broader exploration can scale successfully when grounded in robust verification.

Reinforcement learning (RL) has become a powerful tool for improving world models, but current methods often struggle with "reward hacking"—where a model learns to exploit flaws in the reward system to get a high score without actually improving its performance. This paper introduces a new approach to solve this by moving away from conservative, narrow training methods. The authors propose a framework that combines a more intelligent, agent-based reward system with a smarter way to explore different behaviors, allowing world models to learn more effectively in complex, embodied environments.

The Problem: Reward Hacking

In many existing systems, RL agents are kept on a "short leash" to prevent them from producing nonsensical or low-quality outputs. However, this limits the model's ability to learn diverse and useful behaviors. When researchers try to expand this exploration, the models often find "shortcuts" to satisfy the reward function—such as blurring the image to hide mistakes, keeping the scene static to avoid physical errors, or simply ignoring the task requirements. These models receive high scores from automated metrics while failing to perform the actual task, proving that the current reward systems are not robust enough to handle broader exploration.

Reward as an Agent

To fix this, the authors replace static, simple reward functions with "Reward as an Agent." Instead of a single score, this system acts like an intelligent evaluator. It uses a multi-stage process that includes:

  • Planning: Evaluating the overall quality of a video before scoring specific details.

  • Curriculum-based evaluation: Checking basic requirements (like visual quality) before moving on to complex ones (like physical interaction).

  • Voting: Breaking down tasks into smaller components to ensure that a failure in one area, such as object deformation, is caught even if other parts of the video look correct.

  • Reflection: Allowing the agent to double-check its own scoring to ensure consistency.

Dynamic-Aware Exploration

Beyond better verification, the authors introduce "DynDiff-GRPO," a new way for the model to explore different actions. Traditional methods often apply randomness uniformly across the entire scene, which can lead to unstable or messy results. DynDiff-GRPO instead focuses its exploration on "dynamically salient" regions—the parts of the scene that are actually moving or interacting. By concentrating the exploration on these active areas while keeping the background stable, the model can discover a wider range of physically plausible behaviors without sacrificing the structural integrity of the environment.

Impact and Results

By combining the agent-based reward system with dynamic-aware exploration, the researchers demonstrate that RL can successfully scale beyond conservative limits. Their experiments show significant accuracy gains across multiple open-source embodied world models. The results confirm that when exploration is grounded in a robust, multi-layered verification process, models can learn more complex and realistic behaviors, effectively mitigating the risks of reward hacking and paving the way for more capable embodied AI.

Comments (0)

No comments yet

Be the first to share your thoughts!