A systematic evaluation of vision-language models f...

A systematic evaluation of vision-language models f... | AI Research

Key Takeaways

A systematic evaluation of vision-language models for observational astronomical reasoning tasks This research investigates whether modern Vision-Language Mo...
Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities remains untested.
Mechanistic ablations reveal that performance depends not only on directing attention to salient visual features but also on grounding those features in physical knowledge.
Consistent with this picture, presenting the underlying one-dimensional measurements directly as numerical tables instead of rendered plots yields up to 13 percentage points improvement.
These findings provide the first systematic, multi-modal baselines for VLMs in observational astronomy and identify the specific representation, grounding, and reasoning bottlenecks where current models fail.

Paper AbstractExpand

Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities remains untested. We present AstroVLBench, a comprehensive benchmark comprising over 4,100 expert-verified instances across five tasks spanning optical imaging, radio interferometry, multi-wavelength photometry, time-domain light curves, and optical spectroscopy. Evaluating six frontier models, we find that performance is strongly modality-dependent: while one model (Gemini 3 Pro) emerges as the most consistently capable across tasks, task-specific strengths vary, and all models substantially underperform domain-specialized methods. Mechanistic ablations reveal that performance depends not only on directing attention to salient visual features but also on grounding those features in physical knowledge. Phenomenological prompts describing what to look for improve accuracy by sharpening model focus, but physical prompts explaining why those features matter perform better overall and yield more balanced classifications with reduced class-specific bias. Consistent with this picture, presenting the underlying one-dimensional measurements directly as numerical tables instead of rendered plots yields up to 13 percentage points improvement. Reasoning quality analysis further demonstrates that, without explicit physical grounding, models may reach correct predictions from phenomenologically plausible cues while providing physically imprecise justifications, establishing that accuracy alone is insufficient for trustworthy scientific deployment. These findings provide the first systematic, multi-modal baselines for VLMs in observational astronomy and identify the specific representation, grounding, and reasoning bottlenecks where current models fail.

A systematic evaluation of vision-language models for observational astronomical reasoning tasks
This research investigates whether modern Vision-Language Models (VLMs)—AI systems capable of processing both images and text—can serve as reliable tools for interpreting complex astronomical data. While these models have shown promise in general scientific tasks, their performance on real-world astronomical observations across different data types, such as light curves and radio images, has remained largely unverified. The authors introduce "AstroVLBench," a new benchmark consisting of over 4,100 expert-verified instances, to test how well six leading AI models can perform across five fundamental astronomical tasks.

Evaluating AI Across the Cosmos

To understand how these models handle the diversity of space data, the researchers tested them on five distinct categories: optical galaxy imaging, radio interferometry, multi-wavelength photometry, time-domain light curves, and optical spectroscopy. The study found that while some models, particularly Gemini 3 Pro, showed consistent capability, no single model performed uniformly well across all tasks. Overall, these general-purpose models still struggle to match the precision of domain-specialized astronomical software, especially when tasks require complex numerical reasoning rather than just visual pattern recognition.

The Importance of Physical Grounding

A key finding of the study is that how a model is prompted significantly changes its accuracy. The researchers compared three types of instructions: unguided, phenomenological (describing what to look for), and physical (explaining why specific features matter). They discovered that while phenomenological prompts help the AI focus on the right visual details, physical prompts lead to more balanced and reliable classifications. By grounding the AI in the physical principles behind the data, the models become less biased and more accurate in their decision-making.

Data Representation and Reasoning Quality

The format of the data also plays a major role in performance. The study revealed that when models were given numerical tables instead of rendered plots for one-dimensional measurements, their accuracy improved by up to 13 percentage points. This suggests that visual plots can sometimes obscure the precise data points necessary for scientific analysis. Furthermore, the researchers identified a "right-answer-wrong-reason" phenomenon: models sometimes arrived at the correct classification based on plausible visual cues while providing physically incorrect justifications. This highlights a critical limitation for scientific deployment, as accuracy alone does not guarantee that an AI is using sound scientific logic.

Implications for Future Research

The findings establish the first systematic baseline for using VLMs in observational astronomy. The study identifies specific bottlenecks—such as difficulties with point spread functions, spectral line interpretation, and numerical scaling—that currently prevent these models from being fully trustworthy for scientific discovery. By highlighting where these models fail, the research provides a roadmap for improving AI architectures to better handle the nuances of multi-modal astronomical data.