Look-Before-Move: Narrative-Grounded World Visual A...

Look-Before-Move: Narrative-Grounded World Visual A... | AI Research

Key Takeaways

Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds introduces a new framework for camera planning in 3D environments.
As embodied AI and world models increasingly operate in dynamic 3D environments, visual perception must move beyond passively interpreting given observations toward actively deciding what to observe.
We study this problem through camera planning in dynamic 3D story worlds, where the camera must not only generate smooth motion, but also decide what visual evidence should be acquired before it moves.
To realize this capability, we propose Look-Before-Move, a camera planning framework that separates observation specification from motion execution.
We further construct a dynamic 3D Story World Benchmark based on StoryBlender, covering 50 stories, 457 scenes, and 1585 shots with animated characters, semantic scene configurations, and executable 3D environments.

Paper AbstractExpand

As embodied AI and world models increasingly operate in dynamic 3D environments, visual perception must move beyond passively interpreting given observations toward actively deciding what to observe. We study this problem through camera planning in dynamic 3D story worlds, where the camera must not only generate smooth motion, but also decide what visual evidence should be acquired before it moves. We formulate this capability as Narrative-Grounded World Visual Attention, where the camera acts as an embodied observer that determines what to observe, how to compose the observation, and how to shift attention over time under narrative intent and physical 3D constraints. To realize this capability, we propose Look-Before-Move, a camera planning framework that separates observation specification from motion execution. It first builds a Semantic Observation Contract to convert directorial intent into executable visual constraints, then performs Monte Carlo Viewpoint Search to find narrative-compliant and geometrically feasible viewpoints, and finally applies Semantic Trajectory Grounding to connect selected viewpoints into continuous, collision-aware, and temporally coherent camera motion. We further construct a dynamic 3D Story World Benchmark based on StoryBlender, covering 50 stories, 457 scenes, and 1585 shots with animated characters, semantic scene configurations, and executable 3D environments. Experiments show that our framework improves subject perception, intent consistency, and trajectory quality over representative baselines, demonstrating the importance of organizing visual attention before generating camera motion.

Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds introduces a new framework for camera planning in 3D environments. Instead of treating camera movement as a simple path-finding task, the authors argue that an AI must first determine what it needs to see—and why—before it begins to move. By separating the decision of "what to observe" from the execution of "how to move," the system creates camera trajectories that are more consistent with narrative goals and physically feasible within complex 3D scenes.

Defining Narrative-Grounded Visual Attention

The core of this research is the concept of "Narrative-Grounded World Visual Attention." In a dynamic 3D story world, a camera acts as an observer that must track characters, emphasize specific actions, and reveal spatial relationships as a story unfolds. The authors identify that existing methods often struggle because they attempt to generate motion without first establishing a clear "observation contract." This framework forces the AI to act as a director, ensuring that every camera movement is grounded in the specific requirements of the narrative.

The Three-Step Planning Process

The Look-Before-Move framework operates through a structured, three-stage pipeline:

Semantic Observation Contract: The system uses a perception agent to analyze the 3D environment and convert high-level story instructions into specific visual constraints. This defines exactly which subjects to track, how to compose the shot, and what actions must remain visible. 2. Monte Carlo Viewpoint Search: Before moving, the system searches for the best possible viewpoints. It uses a competitive "tournament" process to rank potential camera positions based on visibility, composition, and the absence of physical obstacles like walls or occlusions. 3. Semantic Trajectory Grounding: Once the ideal viewpoints are selected, the system connects them into a smooth, continuous path. It uses a reflection mechanism to simulate the camera's movement, ensuring the final trajectory is collision-free and maintains the intended narrative focus throughout the shot.

Evaluating Performance in 3D Worlds

To test this approach, the authors developed a new benchmark based on StoryBlender, which includes 50 stories, 457 scenes, and 1,585 shots. Unlike 2D video datasets that can suffer from spatial inconsistencies, this benchmark uses fully executable 3D environments. This allows the researchers to verify camera plans against physical reality, checking for issues like character occlusion or unstable motion.

Key Findings

Experiments demonstrate that organizing visual attention before generating motion leads to significant improvements in three specific areas: subject perception, intent consistency, and trajectory quality. By prioritizing the "look" phase, the framework successfully avoids common pitfalls in autonomous cinematography, such as losing track of a subject or creating jerky, unrealistic camera movements. The results highlight that effective camera planning in dynamic 3D worlds is fundamentally a problem of semantic and spatial organization, rather than just geometric path planning.