NVIDIA has introduced SANA-WM, an open-source world model capable of generating minute-long, 720p video sequences on a single GPU. By utilizing a 2.6-billion parameter Diffusion Transformer (DiT) architecture, the system enables high-resolution, camera-controlled video synthesis without the need for the massive compute clusters typically required by competitive open-source baselines.
Architectural Innovations for Efficiency
The core challenge in minute-scale video generation is the quadratic growth of memory and compute complexity associated with standard softmax attention. SANA-WM addresses this by implementing a hybrid architecture that replaces most attention blocks with frame-wise Gated DeltaNet (GDN). This approach maintains a constant-size recurrent state regardless of video length, preventing the drift issues seen in previous linear attention models. To ensure stability, the team introduced an algebraic key-scaling approach that prevents NaN divergence events during training.
To maintain long-range spatial consistency, the model interleaves 15 frame-wise GDN blocks with 5 softmax attention blocks. This configuration allows the system to handle 961 latent frames—the requirement for a 60-second, 720p video—while remaining within the memory constraints of a single GPU.
Precision Camera Control
SANA-WM achieves faithful adherence to 6-DoF trajectories through a dual-branch camera control system. The coarse branch utilizes Unified Camera Positional Encoding (UCPE) at the latent-frame rate to capture global trajectory structure. Simultaneously, a fine branch employs Plücker mixing to process pixel-wise raymaps from raw frames, compensating for compression mismatches that occur when latent tokens summarize multiple raw frames.
This dual-branch approach significantly improves camera motion consistency. Evaluations show that the combination of UCPE and Plücker mixing achieves a Camera Motion Consistency (CamMC) score of 0.2047, outperforming models that rely on single-branch strategies.
Refinement and Performance
To address potential structural artifacts in long-sequence generation, SANA-WM employs a two-stage pipeline. The first stage generates the initial video, which is then processed by a second-stage refiner initialized from the 17B LTX-2 model. By using truncated-σ flow matching and rank-384 LoRA adapters, the refiner significantly reduces long-horizon visual drift, lowering the imaging quality degradation score from 3.09 to 0.31 on hard trajectories.
The resulting system is highly efficient, offering a 36× throughput advantage over larger models like LingBot-World. When using a distilled autoregressive variant with NVFP4 quantization, the model can denoise a full 60-second, 720p clip in 34 seconds on a single RTX 5090. The model is available through the NVlabs/Sana GitHub repository, providing researchers with a practical tool for embodied AI, simulation, and robotics applications.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!