Back to AI Research

AI Research

SIEVES: Selective Prediction Generalizes through Vi... | AI Research

Key Takeaways

  • Multimodal large language models (MLLMs) are increasingly used for high-stakes tasks, yet they often struggle to know when they are likely to be wrong.
  • Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks.
  • Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios.
  • Precisely, selective prediction aims to improve coverage, i.e.
  • the share of inputs the system answers, while adhering to a user-defined risk level.
Paper AbstractExpand

Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. To enable reliable generalization, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all five tested OOD datasets and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation.

Multimodal large language models (MLLMs) are increasingly used for high-stakes tasks, yet they often struggle to know when they are likely to be wrong. This paper introduces SIEVES (Selective Prediction through Visual Evidence Scoring), a framework designed to improve the reliability of these models by enabling them to "abstain" from answering when they are uncertain. By requiring models to provide visual evidence for their answers, SIEVES allows a specialized selector to evaluate the quality of that evidence, significantly improving the system's ability to filter out incorrect responses in real-world, out-of-distribution scenarios.

How SIEVES Works

The core of the SIEVES approach is to move beyond simple confidence scores. When a model answers a visual question, it is prompted to use a "zoom-in" tool to focus on specific regions of an image. This creates a "multimodal chain-of-thought" that links the final answer to specific visual evidence.
The SIEVES selector then evaluates this process along three distinct axes:

  • Correctness: Is the final answer accurate?

  • Localization: Did the model zoom in on the correct part of the image?

  • Coherence: Does the visual evidence actually support the final answer?
    By combining these three signals, the system generates a final confidence score. If this score falls below a user-defined threshold, the model abstains from answering, preventing potentially costly errors.

Why This Approach Is Different

Traditional methods for selective prediction often rely on internal model data, such as log-probabilities or hidden activations. This makes them difficult to use with proprietary models like Gemini or o3, where such internal information is hidden.
SIEVES is model-agnostic because it only looks at the "observable" outputs: the question, the image, the reasoning steps, and the final answer. Because the selector is trained to judge the quality of the visual evidence rather than just the final text, it can be applied to any reasoner—even those it was not specifically trained on—without requiring access to the reasoner’s internal weights.

Performance and Generalization

The researchers tested SIEVES across five challenging benchmarks, including high-resolution natural images, diagrams, tables, and real-world photos from blind users. The results show that SIEVES improves "coverage"—the number of questions the system can safely answer—by up to three times compared to methods that do not use visual grounding.
Notably, the system demonstrated strong generalization. A selector trained only on traces from a smaller, open-source model (Pixel-Reasoner) was able to successfully filter the outputs of much larger, proprietary models like o3 and Gemini-3-Pro, providing a boost in reliability without needing any model-specific training or adaptation.

Key Takeaways

The study highlights that forcing a model to "show its work" through visual evidence makes it much easier to predict when that model is likely to fail. By training a compact, efficient selector to verify the quality of this evidence, the researchers have created a practical way to deploy powerful AI models in high-stakes environments where accuracy is critical and mistakes are costly.

Comments (0)

No comments yet

Be the first to share your thoughts!