Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds introduces a new framework for camera planning in 3D environments. Instead of treating camera movement as a simple path-finding task, the authors argue that an AI must first determine what it needs to see—and why—before it begins to move. By separating the decision of "what to observe" from the execution of "how to move," the system creates camera trajectories that are more consistent with narrative goals and physically feasible within complex 3D scenes.
Defining Narrative-Grounded Visual Attention
The core of this research is the concept of "Narrative-Grounded World Visual Attention." In a dynamic 3D story world, a camera acts as an observer that must track characters, emphasize specific actions, and reveal spatial relationships as a story unfolds. The authors identify that existing methods often struggle because they attempt to generate motion without first establishing a clear "observation contract." This framework forces the AI to act as a director, ensuring that every camera movement is grounded in the specific requirements of the narrative.
The Three-Step Planning Process
The Look-Before-Move framework operates through a structured, three-stage pipeline:
- Semantic Observation Contract: The system uses a perception agent to analyze the 3D environment and convert high-level story instructions into specific visual constraints. This defines exactly which subjects to track, how to compose the shot, and what actions must remain visible. 2. Monte Carlo Viewpoint Search: Before moving, the system searches for the best possible viewpoints. It uses a competitive "tournament" process to rank potential camera positions based on visibility, composition, and the absence of physical obstacles like walls or occlusions. 3. Semantic Trajectory Grounding: Once the ideal viewpoints are selected, the system connects them into a smooth, continuous path. It uses a reflection mechanism to simulate the camera's movement, ensuring the final trajectory is collision-free and maintains the intended narrative focus throughout the shot.
Evaluating Performance in 3D Worlds
To test this approach, the authors developed a new benchmark based on StoryBlender, which includes 50 stories, 457 scenes, and 1,585 shots. Unlike 2D video datasets that can suffer from spatial inconsistencies, this benchmark uses fully executable 3D environments. This allows the researchers to verify camera plans against physical reality, checking for issues like character occlusion or unstable motion.
Key Findings
Experiments demonstrate that organizing visual attention before generating motion leads to significant improvements in three specific areas: subject perception, intent consistency, and trajectory quality. By prioritizing the "look" phase, the framework successfully avoids common pitfalls in autonomous cinematography, such as losing track of a subject or creating jerky, unrealistic camera movements. The results highlight that effective camera planning in dynamic 3D worlds is fundamentally a problem of semantic and spatial organization, rather than just geometric path planning.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!