Back to AI Research

AI Research

Latent Actions from Factorized Transition Effects u... | AI Research

Key Takeaways

  • Latent Action Models (LAMs) are designed to learn how to control an agent by observing video transitions, such as how a robot moves from one frame to the nex...
  • Latent Action Models (LAMs) learn action-like proxies from observation transitions.
  • However, in multi-object or distractor-rich scenes, these visual effects mix agent motion with distractors, camera dynamics, and background changes, making the underlying action source ambiguous without supervision.
  • Structuring this mixture as reusable transition effects provides an intermediate representation from which action-like latents can be more robustly formed.
  • We introduce Observed Transition Factorization (OTF), which decomposes each transition into a sparse set of observed transition primitives.
Paper AbstractExpand

Latent Action Models (LAMs) learn action-like proxies from observation transitions. However, in multi-object or distractor-rich scenes, these visual effects mix agent motion with distractors, camera dynamics, and background changes, making the underlying action source ambiguous without supervision. Structuring this mixture as reusable transition effects provides an intermediate representation from which action-like latents can be more robustly formed. We introduce Observed Transition Factorization (OTF), which decomposes each transition into a sparse set of observed transition primitives. Using these primitives as the transition interface, we propose OTF-LAM, which abstracts motion primitives into action-like latents within the standard inverse-forward dynamics framework, and OTF-LAM-Dino, a decoder-free variant that predicts future states in a frozen DINOv2 representation space. Empirically, OTF primitives transfer zeroshot across controlled carrier and morphology shifts, showing reusability. Furthermore, downstream policy learning results match or outperform baselines under complex transition ambiguity.

Latent Action Models (LAMs) are designed to learn how to control an agent by observing video transitions, such as how a robot moves from one frame to the next. However, in complex scenes filled with moving backgrounds, camera shifts, or multiple objects, it is difficult for these models to distinguish between the agent's intentional actions and random environmental noise. This paper introduces a new framework called Observed Transition Factorization (OTF) to solve this ambiguity by breaking down visual changes into a set of reusable, primitive building blocks before attempting to learn actions.

Decomposing Visual Motion

The core challenge in observation-only learning is that a single pixel transition mixes many sources of change. Instead of trying to guess the "true" action immediately, the researchers propose a bottom-up approach. They use OTF to decompose a transition into a sparse set of "observed-transition primitives." These primitives act as a shared vocabulary—capturing recurring visual patterns like local displacement, edge shifts, or background drift—without needing to know which specific object caused them. By discretizing these effects into a codebook, the model creates a structured, reusable intermediate representation of how the world changes.

Building Action-Like Latents

Once the transition is factorized into these primitives, the model uses them to form "action-like latents." The researchers propose two main architectures: OTF-LAM, which uses a decoder to predict future frames, and OTF-LAM-Dino, a more advanced version that predicts future states within a frozen, pre-trained DINOv2 representation space. Both versions use a "relevance gate" to filter out irrelevant background noise, focusing only on the primitives that are most useful for predicting the next state. This allows the model to isolate the agent's motion from the surrounding environment.

Demonstrating Reusability and Performance

The researchers tested their approach on the Distracting Control Suite and a controlled Moving MNIST benchmark. A key finding is that the learned primitives are highly transferable; for example, a vocabulary trained on one type of agent (a walker) can be applied to a completely different one (a cheetah) without further training. Furthermore, when the researchers used a small amount of labeled data to map these latent actions to real-world commands, their model matched or outperformed existing baselines. This confirms that by focusing on reusable transition effects rather than trying to identify objects or specific actions from scratch, the model becomes more robust to complex visual ambiguity.

Key Considerations

It is important to note that this framework does not claim to identify the "true" action directly from pixels. Because visual transitions are not always unique to a single cause, the model identifies an "equivalence class" of causes and effects. Additionally, while the model is designed to be more scalable than object-centric approaches, it still relies on a two-stage training process: first learning the transition vocabulary, and then training the latent action model on top of those frozen primitives. This separation of concerns is what allows the system to remain effective even when the visual environment is cluttered or distracting.

Comments (0)

No comments yet

Be the first to share your thoughts!