Back to AI Research

AI Research

SpatialWorld: Benchmarking Interactive Spatial Reas... | AI Research

Key Takeaways

  • SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks SpatialWorld is a new benchmark designed to evaluate how we...
  • Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world.
  • However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding.
  • We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks.
  • Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs.
Paper AbstractExpand

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks
SpatialWorld is a new benchmark designed to evaluate how well multimodal AI models can navigate and interact with 3D environments. While many existing AI tests rely on static images or pre-recorded videos, SpatialWorld requires agents to actively explore, make decisions, and complete tasks in complex, dynamic settings. By using a unified interface across eight different simulation environments, the benchmark provides a rigorous way to test an agent's ability to reason about space and plan actions in real-time.

A Unified Approach to Spatial Reasoning

To ensure a fair and consistent evaluation, SpatialWorld moves away from simulator-specific testing. It uses a standardized protocol where agents receive only first-person, egocentric visual input—mimicking how a human perceives the world. Agents are not given "privileged" information, such as depth maps or global coordinates. Instead, they must rely on their own visual observations to guide their decisions. The benchmark uses a text-based action interface, allowing models to perform tasks like navigating, rotating, and interacting with objects, which makes the agents' decision-making processes more interpretable.

Diverse Tasks and Environments

The benchmark includes 760 human-annotated tasks that cover a wide range of scenarios, including household routines, work and study activities, travel, and social collaboration. These tasks are distributed across eight different simulation backends, such as AI2-THOR and CARLA. To isolate specific cognitive abilities, the researchers also included "3D games" that remove photorealistic visual distractions, allowing them to test pure geometric and topological reasoning. Each task is validated by humans to ensure that the goals are clear and the success criteria are objective.

Performance of Current AI Models

The researchers tested 15 advanced multimodal models, including both open-source and proprietary versions. The results show that spatial reasoning remains a significant challenge for even the most capable AI. The strongest model, GPT-5, achieved an average task success rate (TSR) of 17.4%, while the leading open-source model, Qwen-3.5, reached 14.1%. The study also noted a mismatch between success and efficiency; models that managed to complete tasks often did so through redundant exploration rather than streamlined planning.

Key Takeaways for Future Research

The findings highlight that current AI agents struggle with long-horizon planning and active exploration in 3D spaces. Performance varied significantly depending on the domain, with different models excelling in specific areas like digital games versus physical household tasks. By providing a standardized, simulator-agnostic testbed, SpatialWorld aims to help researchers identify and address the specific bottlenecks in spatial intelligence, moving the field closer to creating agents that can reliably operate in the physical world.

Comments (0)

No comments yet

Be the first to share your thoughts!