Netflix Open Sources VOID: AI for Physics-Aware Video Object Removal

Key Takeaways

  • Solves the 'floating object' problem in video editing by simulating realistic physics like gravity and collisions after an object is removed.
  • Introduces a novel 'quadmask' architecture that allows AI to distinguish between background, removed objects, and interaction-affected regions.
  • Provides a high-quality, open-source solution for VFX teams to automate complex scene cleanup that previously required manual labor.

The Netflix AI research team, in collaboration with INSAIT and Sofia University St. Kliment Ohridski, has released VOID (Video Object and Interaction Deletion), an open-source AI model designed to remove objects from video footage while maintaining physical consistency. While traditional video inpainting models excel at filling in background pixels, they often struggle when the removed object has interacted with the environment, such as a person holding a prop. VOID addresses this by reasoning about causality, ensuring that secondary effects—like an object falling when its support is removed—are rendered in a physically plausible manner.

Understanding Physical Causality

Standard video inpainting tools function primarily as sophisticated background painters, focusing on pixel-level reconstruction. However, these models frequently fail when an object’s removal necessitates a change in scene dynamics, such as collisions or gravity-based movement. VOID overcomes these limitations by understanding the scene's physical context. For instance, if a person holding a guitar is removed from a frame, the model recognizes that the guitar was supported by the actor and calculates that the instrument should fall naturally, rather than hovering in mid-air.

Technical Architecture and Quadmasking

Built on the CogVideoX-Fun-V1.5-5b-InP model, VOID utilizes a 3D Transformer architecture to process temporal sequences of frames. A central innovation in this approach is the quadmask, a four-value semantic map that provides the model with a structured understanding of the scene. Each pixel in the mask is assigned a value: 0 for the primary object to be removed, 63 for overlap regions, 127 for interaction-affected regions like falling or displaced items, and 255 for the background. This allows the model to differentiate between what should be deleted and what must be dynamically adjusted to reflect the removal.

Two-Pass Inference and Synthetic Training

To address the common issue of object morphing—where synthesized objects deform across frames—the researchers implemented a two-pass inference pipeline. While the first pass provides the base inpainting, the optional second pass uses optical flow-warped noise to stabilize object shapes along their trajectories. This process anchors the appearance of synthesized elements frame-to-frame, ensuring higher temporal consistency.
The model’s ability to reason about physics was made possible by training on paired counterfactual videos generated synthetically. Because real-world data showing the same scene with and without specific object interactions does not exist at scale, the team used the HUMOTO framework for human-object interactions and Google’s Kubric for object-object collisions. By re-simulating physics in Blender, the researchers created a dataset of ground-truth video pairs where the physical consequences of object removal are provably correct.

Comments (0)

No comments yet

Be the first to share your thoughts!