Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models
Vision-language models (VLMs) are highly capable, yet they often struggle with spatial reasoning tasks where the necessary information is not directly visible in the input. To solve problems like predicting what a room looks like from a different angle or tracing a path through an occluded space, humans rely on imagination. This paper introduces "Imaginative Perception Tokens" (IPT), a method that allows models to generate intermediate visual representations of unobserved spatial configurations. By training models to "imagine" these missing views, the researchers enable them to perform complex spatial reasoning more effectively than by using text-based explanations alone.
The Challenge of Spatial Reasoning
Many spatial tasks require more than just identifying objects in a photo; they require understanding 3D geometry and how it changes. For example, if a model is asked what an object will look like after moving to a new position, it must mentally simulate that new perspective. Existing models often fail here because they rely on what is already visible. While some methods use text-based "chain-of-thought" to break down these problems, the researchers found that forcing spatial computation through language often leads to a "modality mismatch," where the model struggles to translate geometric concepts into words, sometimes even degrading performance.
How Imaginative Perception Tokens Work
The researchers developed a two-stage generative process using a unified model called BAGEL. Instead of jumping straight to an answer, the model first generates an "imaginative perception"—a visual representation of the scene from an unseen viewpoint or an integrated map. 1. Generation: The model predicts a visual latent that represents the missing spatial structure, such as a top-down view or a rotated perspective. 2. Integration: This generated image is re-encoded and fed back into the model’s context. 3. Reasoning: The model then uses this "imagined" visual evidence to provide a final, more accurate answer.
This approach ensures the model’s imagination remains consistent with the original input while providing a concrete visual foundation for its reasoning.
Key Tasks and Results
To test this capability, the researchers created three specific tasks with approximately 20,000 examples each:
Perspective Taking: Predicting how a scene appears after moving to a new location.
Path Tracing: Inferring what an agent would see while moving along a path through an occluded space.
Multiview Counting: Integrating multiple partial views to count objects in a room.
The results show that IPT supervision consistently improves spatial reasoning. On the Multiview Counting task, accuracy improved by 3.4%. Furthermore, the researchers found that combining IPT with label-only data yielded the best performance, outperforming textual chain-of-thought methods. Interestingly, these gains often persisted even when the model did not explicitly generate an image at inference time, suggesting that the training process helps the model build stronger internal spatial representations.
Important Considerations
The study highlights that the quality of imagination and the structure of the task are critical to success. While IPT provides a principled way to supervise spatial reasoning, the researchers noted that the benefits can vary depending on the specific task. The findings suggest that for spatial problems, visual intermediate representations are a more natural and effective "scratchpad" for models than language, as they align better with the underlying geometry of the physical world.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!