Back to AI Research

AI Research

Thinking in Text and Images: Interleaved Vision--La... | AI Research

Key Takeaways

  • Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation Long-horizon robotic tasks—such as picking up...
  • Long-horizon robotic manipulation requires plans that are both logically coherent and geometrically grounded.
  • Because standard robot datasets lack such traces, we construct pseudo-supervision by temporally segmenting demonstrations and captioning each stage with a vision-language model.
  • Ablations show that both modalities are necessary: without traces, LIBERO-Long success drops to 37.7\%; text-only and vision-only traces reach 62.0\% and 68.4\%, while the full interleaved trace reaches 92.4\%.
  • Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
Paper AbstractExpand

Long-horizon robotic manipulation requires plans that are both logically coherent and geometrically grounded. Existing Vision-Language-Action policies usually hide planning in latent states or expose only one modality: text-only chain-of-thought encodes causal order but misses spatial constraints, while visual prediction provides geometric cues but often remains local and semantically underconstrained. We introduce Interleaved Vision--Language Reasoning (IVLR), a policy framework built around \trace{}, an explicit intermediate representation that alternates textual subgoals with visual keyframes over the full task horizon. At test time, a single native multimodal transformer self-generates this global semantic-geometric trace from the initial observation and instruction, caches it, and conditions a closed-loop action decoder on the trace, original instruction, and current observation. Because standard robot datasets lack such traces, we construct pseudo-supervision by temporally segmenting demonstrations and captioning each stage with a vision-language model. Across simulated benchmarks for long-horizon manipulation and visual distribution shift, \method{} reaches 95.5\% average success on LIBERO, including 92.4\% on LIBERO-Long, and 59.4\% overall success on SimplerEnv-WidowX. Ablations show that both modalities are necessary: without traces, LIBERO-Long success drops to 37.7\%; text-only and vision-only traces reach 62.0\% and 68.4\%, while the full interleaved trace reaches 92.4\%. Stress tests with execution perturbations and masked trace content show moderate degradation, suggesting that the trace can tolerate local corruption and moderate execution drift, but remains limited under stale or incorrect global plans.

Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
Long-horizon robotic tasks—such as picking up a mug and placing it in a specific spot before interacting with another object—require a robot to maintain a logical sequence of steps while understanding the physical geometry of its environment. Current robot policies often struggle with this because they either hide their planning process in "black box" latent states or rely on only one type of information, such as text-only instructions or visual predictions. This paper introduces Interleaved Vision–Language Reasoning (IVLR), a framework that creates an explicit "storyboard" of the entire task before the robot begins moving. By combining textual subgoals with visual keyframes, the robot gains both a clear causal plan and a spatial anchor for every stage of the operation.

How the Approach Works

The IVLR framework functions by generating a full-horizon "trace" at the start of an episode. This trace is a sequence of pairs, where each pair consists of a text description (the causal role of the step) and a visual keyframe (the expected spatial state of the scene). Because standard robot datasets do not include these types of traces, the researchers developed a pipeline to automatically create "pseudo-traces" from existing demonstration data. They use a visual decomposition tool to break trajectories into stages and a vision-language model to caption each stage. Once the model generates this trace at the start of a task, it caches the information and uses it to guide a closed-loop action decoder, which continuously adjusts the robot's movements based on the current scene and the cached plan.

Why Interleaving Matters

A key finding of the research is that both text and visual components are necessary for success. In experiments on the LIBERO-Long benchmark, the researchers compared the full interleaved trace against versions that used only text or only vision. Without any trace, the success rate dropped significantly to 37.7%. Text-only traces reached 62.0%, and vision-only traces reached 68.4%. However, when the two were combined into the full IVLR-Trace, the success rate jumped to 92.4%. This demonstrates that while text provides the necessary causal logic, it lacks the spatial precision of visual keyframes, and while visual predictions provide geometric cues, they can drift without the semantic structure provided by language.

Performance and Robustness

The IVLR framework demonstrated strong performance across simulated benchmarks. On the LIBERO suite, it achieved an average success rate of 95.5%, with particularly high marks in long-horizon tasks. In the SimplerEnv-WidowX environment, which tests how well a policy handles visual changes like different lighting or backgrounds, the model achieved a 59.4% success rate, outperforming previous methods. Stress tests also revealed that the policy is resilient; it can tolerate moderate execution errors and partial corruption of the trace, suggesting that the robot does not rely on a "perfect" plan to function but uses the trace as a flexible guide.

Current Limitations

While the results are promising, the researchers note that the current approach has specific limitations. The system is designed for static, fully observed environments where the workspace is visible from the start. Additionally, generating the full trace introduces a brief planning latency at the beginning of the task, as the model takes about 10 seconds to create the storyboard before the robot begins its first movement. The authors emphasize that this work is a step toward more explicit, interpretable reasoning in robotics and that further research is needed to move these capabilities from simulation into real-world physical environments.

Comments (0)

No comments yet

Be the first to share your thoughts!