AI Research

Reinforcing VLAs in Task-Agnostic World Models | AI Research

Key Takeaways

Reinforcing VLAs in Task-Agnostic World Models Training robotic Vision-Language-Action (VLA) models usually requires massive amounts of real-world data, whic...
Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions.
To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference.
We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies.
RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation.

Paper AbstractExpand

Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.

Reinforcing VLAs in Task-Agnostic World Models
Training robotic Vision-Language-Action (VLA) models usually requires massive amounts of real-world data, which is both expensive and slow to collect. While researchers have begun using "world models"—virtual simulators that allow robots to practice in their own imagination—these systems typically need to be rebuilt from scratch for every new task. This paper introduces RAW-Dream, a new framework that decouples the simulator from specific tasks. By using a world model pre-trained on general physical behaviors and an off-the-shelf vision-language model to judge success, the system can adapt to entirely new tasks without needing task-specific training data.

A Universal Simulator

The core innovation of RAW-Dream is its ability to treat physical dynamics as task-independent. Whether a robot is asked to move a bowl or clean a shelf, the underlying physics of how objects move remains the same. The researchers pre-trained their world model on a diverse collection of "play data"—unstructured, task-free interactions—rather than specific expert demonstrations. This allows the world model to act as a general-purpose simulator that understands how the world works, enabling it to predict outcomes for tasks it has never seen before.

Zero-Shot Rewards and Verification

Because the world model is not built for a specific task, the system needs a way to evaluate whether a robot’s "imagined" actions are successful. RAW-Dream uses a pre-existing Vision-Language Model (VLM) to act as an automated judge. This model watches the imagined video rollouts and determines if the robot successfully followed the instructions. To prevent the system from being fooled by "hallucinations"—where the world model generates a fake success that isn't physically accurate—the researchers added a "dual-noise verification" mechanism. This process re-runs the robot's actions under different conditions; if the VLM judge doesn't see the same success both times, the result is discarded as unreliable.

Performance and Scalability

The researchers tested RAW-Dream on both simulated environments and physical robots. In simulation, the system significantly outperformed baseline models that relied on traditional, data-heavy training methods. On physical robots, the approach improved success rates by over 21% compared to standard fine-tuning methods. By removing the need to collect thousands of task-specific trajectories, RAW-Dream offers a more efficient and scalable roadmap for teaching robots new skills, as the simulator and reward systems only need to be built once to handle a wide variety of future tasks.

Comments (0)

No comments yet

Be the first to share your thoughts!