Back to AI Research

AI Research

Imaginative Perception Tokens Enhance Spatial Reaso... | AI Research

Key Takeaways

  • Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models Vision-language models (VLMs) are highly capable, yet they often strugg...
  • Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable.
  • Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation.
  • Using the unified VLM BAGEL as the backbone, IPT supervision consistently improves spatial reasoning and often outperforms textual chain of thought training, even without generating images at inference time.
  • On MVC, IPT improves accuracy by 3.4% and achieves competitive performance with strong closed-source models on PT.
Paper AbstractExpand

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under alternative spatial configurations while remaining consistent with the observed input. To study this capability, we formulate three tasks, Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC), and construct datasets of approximately 20K examples with ground truth imaginations, answers, and evaluation benchmarks. Using the unified VLM BAGEL as the backbone, IPT supervision consistently improves spatial reasoning and often outperforms textual chain of thought training, even without generating images at inference time. On MVC, IPT improves accuracy by 3.4% and achieves competitive performance with strong closed-source models on PT. We further find that combining IPT and label-only supervision yields additional gains, whereas textual chain of thought can substantially degrade performance, suggesting a modality mismatch when spatial computation is forced through language. Overall, IPT provides a principled supervision signal for reasoning about unobserved spatial structure, improving generalization while producing interpretable intermediate representations.

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models
Vision-language models (VLMs) are highly capable, yet they often struggle with spatial reasoning tasks where the necessary information is not directly visible in the input. To solve problems like predicting what a room looks like from a different angle or tracing a path through an occluded space, humans rely on imagination. This paper introduces "Imaginative Perception Tokens" (IPT), a method that allows models to generate intermediate visual representations of unobserved spatial configurations. By training models to "imagine" these missing views, the researchers enable them to perform complex spatial reasoning more effectively than by using text-based explanations alone.

The Challenge of Spatial Reasoning

Many spatial tasks require more than just identifying objects in a photo; they require understanding 3D geometry and how it changes. For example, if a model is asked what an object will look like after moving to a new position, it must mentally simulate that new perspective. Existing models often fail here because they rely on what is already visible. While some methods use text-based "chain-of-thought" to break down these problems, the researchers found that forcing spatial computation through language often leads to a "modality mismatch," where the model struggles to translate geometric concepts into words, sometimes even degrading performance.

How Imaginative Perception Tokens Work

The researchers developed a two-stage generative process using a unified model called BAGEL. Instead of jumping straight to an answer, the model first generates an "imaginative perception"—a visual representation of the scene from an unseen viewpoint or an integrated map. 1. Generation: The model predicts a visual latent that represents the missing spatial structure, such as a top-down view or a rotated perspective. 2. Integration: This generated image is re-encoded and fed back into the model’s context. 3. Reasoning: The model then uses this "imagined" visual evidence to provide a final, more accurate answer.
This approach ensures the model’s imagination remains consistent with the original input while providing a concrete visual foundation for its reasoning.

Key Tasks and Results

To test this capability, the researchers created three specific tasks with approximately 20,000 examples each:

  • Perspective Taking: Predicting how a scene appears after moving to a new location.

  • Path Tracing: Inferring what an agent would see while moving along a path through an occluded space.

  • Multiview Counting: Integrating multiple partial views to count objects in a room.
    The results show that IPT supervision consistently improves spatial reasoning. On the Multiview Counting task, accuracy improved by 3.4%. Furthermore, the researchers found that combining IPT with label-only data yielded the best performance, outperforming textual chain-of-thought methods. Interestingly, these gains often persisted even when the model did not explicitly generate an image at inference time, suggesting that the training process helps the model build stronger internal spatial representations.

Important Considerations

The study highlights that the quality of imagination and the structure of the task are critical to success. While IPT provides a principled way to supervise spatial reasoning, the researchers noted that the benefits can vary depending on the specific task. The findings suggest that for spatial problems, visual intermediate representations are a more natural and effective "scratchpad" for models than language, as they align better with the underlying geometry of the physical world.

Comments (0)

No comments yet

Be the first to share your thoughts!