SpatialScore: The First Comprehensive Benchmark for Evaluating Multimodal AI's Spatial Understanding
Multimodal large language models (MLLMs) have made impressive strides in answering questions about images and videos, but one critical capability remains underexplored: spatial reasoning. A new paper titled SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding introduces the most comprehensive benchmark yet for assessing how well AI systems understand 3D space, camera dynamics, and geometric relationships in visual data.
Why Spatial Understanding Matters
While today's MLLMs excel at identifying objects or describing scenes, their ability to reason about depth, distances, camera movements, and spatial transformations lags far behind human capabilities. This gap is particularly problematic for real-world applications like robotics, autonomous navigation, and augmented reality, where precise spatial awareness is essential.
Introducing SpatialScore
The researchers from Shanghai Jiao Tong University and Shanghai AI Laboratory present SpatialScore, a massive benchmark combining:
- VGBench: A new dataset of 6,000 samples testing core visual geometry skills like camera pose estimation, depth perception, and homography transformations.
- 11 existing datasets: Carefully curated spatial reasoning tasks from sources like MMVP, RealWorldQA, and SpatialSense.
The result is 28,093 diverse samples covering:
- 8 categories of spatial tasks (counting, object localization, 3D relationships, etc.)
- Multiple question formats (judgment, multiple choice, open-ended)
- Various input modalities (single images, multi-frame sequences, videos)
- A specially curated SpatialScore-Hard subset (1,400 particularly challenging examples)
Key Findings: AI Still Struggles with Space
The paper evaluates 25 MLLMs across different scales (1B to 78B parameters), revealing:
- Size doesn't guarantee spatial smarts: While larger models generally perform better, even the biggest models (like InternVL3-78B) only achieve 60.28% accuracy overall on SpatialScore.
- Geometry is hard: Tasks requiring precise visual geometry understanding (camera parameters, homography matrices) see accuracy as low as 20-30%.
- Specialized training helps, but isn't enough: Models fine-tuned for spatial tasks (like SpaceLLaVA-13B) often fail to generalize, sometimes performing worse than general-purpose models.
SpatialAgent: A Toolbox for Better Spatial Reasoning
To boost performance, the team developed SpatialAgent, a multi-agent system that equips MLLMs with specialized tools for:
- 2D perception (object detection, segmentation)
- Motion analysis (optical flow estimation)
- 3D reasoning (depth estimation, camera pose prediction)
SpatialAgent supports two reasoning paradigms:
- Plan-Execute: Breaks problems into structured sub-tasks
- ReAct: Iteratively refines understanding through tool interactions
Remarkably, SpatialAgent enables smaller models (like Qwen2.5VL-7B) to outperform much larger standalone models on SpatialScore-Hard, achieving 46.08% accuracy vs. 30.57% for GPT-4o.
Why This Matters for Business
As companies increasingly deploy multimodal AI in physical-world applications, understanding these spatial limitations is crucial. The SpatialScore benchmark provides:
- Better evaluation: A rigorous way to assess AI systems for robotics, AR/VR, and autonomous systems
- Clear improvement targets: Identifies specific weaknesses in current models
- A path forward: Demonstrates how tool-augmented systems can enhance spatial reasoning
The researchers have made all code and data publicly available, inviting the community to build on their work. As one of the authors notes: "SpatialScore offers valuable insights and serves as a rigorous benchmark for the next evolution of MLLMs."
For AI teams working on applications requiring spatial understanding, this benchmark represents both a challenge to overcome and a toolkit for measuring progress.