Multimodal large language models (MLLMs) are increasingly used for high-stakes tasks, yet they often struggle to know when they are likely to be wrong. This paper introduces SIEVES (Selective Prediction through Visual Evidence Scoring), a framework designed to improve the reliability of these models by enabling them to "abstain" from answering when they are uncertain. By requiring models to provide visual evidence for their answers, SIEVES allows a specialized selector to evaluate the quality of that evidence, significantly improving the system's ability to filter out incorrect responses in real-world, out-of-distribution scenarios.
How SIEVES Works
The core of the SIEVES approach is to move beyond simple confidence scores. When a model answers a visual question, it is prompted to use a "zoom-in" tool to focus on specific regions of an image. This creates a "multimodal chain-of-thought" that links the final answer to specific visual evidence.
The SIEVES selector then evaluates this process along three distinct axes:
Correctness: Is the final answer accurate?
Localization: Did the model zoom in on the correct part of the image?
Coherence: Does the visual evidence actually support the final answer?
By combining these three signals, the system generates a final confidence score. If this score falls below a user-defined threshold, the model abstains from answering, preventing potentially costly errors.
Why This Approach Is Different
Traditional methods for selective prediction often rely on internal model data, such as log-probabilities or hidden activations. This makes them difficult to use with proprietary models like Gemini or o3, where such internal information is hidden.
SIEVES is model-agnostic because it only looks at the "observable" outputs: the question, the image, the reasoning steps, and the final answer. Because the selector is trained to judge the quality of the visual evidence rather than just the final text, it can be applied to any reasoner—even those it was not specifically trained on—without requiring access to the reasoner’s internal weights.
Performance and Generalization
The researchers tested SIEVES across five challenging benchmarks, including high-resolution natural images, diagrams, tables, and real-world photos from blind users. The results show that SIEVES improves "coverage"—the number of questions the system can safely answer—by up to three times compared to methods that do not use visual grounding.
Notably, the system demonstrated strong generalization. A selector trained only on traces from a smaller, open-source model (Pixel-Reasoner) was able to successfully filter the outputs of much larger, proprietary models like o3 and Gemini-3-Pro, providing a boost in reliability without needing any model-specific training or adaptation.
Key Takeaways
The study highlights that forcing a model to "show its work" through visual evidence makes it much easier to predict when that model is likely to fail. By training a compact, efficient selector to verify the quality of this evidence, the researchers have created a practical way to deploy powerful AI models in high-stakes environments where accuracy is critical and mistakes are costly.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!