SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks
SpatialWorld is a new benchmark designed to evaluate how well multimodal AI models can navigate and interact with 3D environments. While many existing AI tests rely on static images or pre-recorded videos, SpatialWorld requires agents to actively explore, make decisions, and complete tasks in complex, dynamic settings. By using a unified interface across eight different simulation environments, the benchmark provides a rigorous way to test an agent's ability to reason about space and plan actions in real-time.
A Unified Approach to Spatial Reasoning
To ensure a fair and consistent evaluation, SpatialWorld moves away from simulator-specific testing. It uses a standardized protocol where agents receive only first-person, egocentric visual input—mimicking how a human perceives the world. Agents are not given "privileged" information, such as depth maps or global coordinates. Instead, they must rely on their own visual observations to guide their decisions. The benchmark uses a text-based action interface, allowing models to perform tasks like navigating, rotating, and interacting with objects, which makes the agents' decision-making processes more interpretable.
Diverse Tasks and Environments
The benchmark includes 760 human-annotated tasks that cover a wide range of scenarios, including household routines, work and study activities, travel, and social collaboration. These tasks are distributed across eight different simulation backends, such as AI2-THOR and CARLA. To isolate specific cognitive abilities, the researchers also included "3D games" that remove photorealistic visual distractions, allowing them to test pure geometric and topological reasoning. Each task is validated by humans to ensure that the goals are clear and the success criteria are objective.
Performance of Current AI Models
The researchers tested 15 advanced multimodal models, including both open-source and proprietary versions. The results show that spatial reasoning remains a significant challenge for even the most capable AI. The strongest model, GPT-5, achieved an average task success rate (TSR) of 17.4%, while the leading open-source model, Qwen-3.5, reached 14.1%. The study also noted a mismatch between success and efficiency; models that managed to complete tasks often did so through redundant exploration rather than streamlined planning.
Key Takeaways for Future Research
The findings highlight that current AI agents struggle with long-horizon planning and active exploration in 3D spaces. Performance varied significantly depending on the domain, with different models excelling in specific areas like digital games versus physical household tasks. By providing a standardized, simulator-agnostic testbed, SpatialWorld aims to help researchers identify and address the specific bottlenecks in spatial intelligence, moving the field closer to creating agents that can reliably operate in the physical world.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!