Back to AI Research

AI Research

Rethinking Vacuity for OOD Detection in Evidential... | AI Research

Key Takeaways

  • Rethinking Vacuity for OOD Detection in Evidential Deep Learning This paper investigates a critical evaluation flaw in Evidential Deep Learning (EDL) when ap...
  • Vacuity, or Uncertainty Mass (UM), is commonly used as a metric to evaluate Out-of-Distribution (OOD) detection in Evidential Deep Learning (EDL).
  • It generally involves dividing the number of classes ($K$) by the total strength of belief ($S$) of the model's predictions, where $S$ is derived from summing the Dirichlet parameters.
  • As such, UM is sensitive to the cardinality of $K$.
  • In particular, it is unlikely in practice that there is a linear relationship between $K$ and $S$ as $K$ and $S$ increase due to the nature of EDL (suppressing incorrectly assigned evidence).
Paper AbstractExpand

Vacuity, or Uncertainty Mass (UM), is commonly used as a metric to evaluate Out-of-Distribution (OOD) detection in Evidential Deep Learning (EDL). It generally involves dividing the number of classes ($K$) by the total strength of belief ($S$) of the model's predictions, where $S$ is derived from summing the Dirichlet parameters. As such, UM is sensitive to the cardinality of $K$. In particular, it is unlikely in practice that there is a linear relationship between $K$ and $S$ as $K$ and $S$ increase due to the nature of EDL (suppressing incorrectly assigned evidence). As a result, when comparing In Distribution (ID) and OOD results, it is important that $K_{\mathrm{ID}}$ and $K_{\mathrm{OOD}}$ are equal; something that is not always ensured in practice. We provide an empirical demonstration of how results for AUROC and AUPR can substantially differ when class cardinality between ID and OOD differs by 1, with AUROC differing by as much as 0.318 and AUPR by 0.613 for standard EDL, and AUROC by 0.360 and AUPR by 0.683 for IB-EDL. More concretely, our findings isolate an evaluation artefact: when K differs between ID and OOD, AUROC/AUPR can be artificially inflated without any change in model predictions. We further discuss the evaluation of EDL over causal language models using Multiple-Choice Question-Answer (MCQA) datasets and argue for clearer definitions of ID and OOD in this context. Our primary contribution is an empirical and theoretical demonstration that vacuity-based OOD detection in EDL-fine-tuned LLMs is highly sensitive to uncontrolled differences in evaluated class cardinality.

Rethinking Vacuity for OOD Detection in Evidential Deep Learning
This paper investigates a critical evaluation flaw in Evidential Deep Learning (EDL) when applied to Large Language Models (LLMs). Specifically, it examines how the metric used to measure uncertainty—known as "vacuity" or Uncertainty Mass (UM)—is highly sensitive to the number of classes ($K$) in a dataset. The research demonstrates that when the number of classes differs between In-Distribution (ID) and Out-of-Distribution (OOD) data, performance metrics like AUROC and AUPR can be artificially inflated, creating a false impression of a model's ability to detect OOD inputs.

The Problem with Vacuity

In EDL, vacuity is calculated as $K/S$, where $K$ is the number of classes and $S$ is the model's total strength of belief. Because this formula relies directly on $K$, the metric is not naturally invariant to changes in the number of answer options. If a model is evaluated on a four-class dataset for ID and a five-class dataset for OOD, the resulting uncertainty scores shift simply because of the change in dimensionality, rather than because the model is actually "uncertain" about the input. The author proves that vacuity only remains stable during class expansion if the new class receives a specific, mathematically precise amount of evidence, which rarely happens in practice.

Evaluation Artefacts in LLMs

When applying EDL to LLMs using Multiple-Choice Question-Answer (MCQA) datasets, the "classes" are often placeholders (like A, B, C, D) that change meaning with every question. Because different datasets have different numbers of answer options, researchers often inadvertently compare models across mismatched class cardinalities. The paper shows that this mismatch acts as an evaluation artefact. By isolating $K$ in experiments, the research demonstrates that simply increasing the number of classes in the OOD evaluation set—without changing the model's actual predictions—can cause AUROC scores to jump by as much as 0.360.

Impact on Current Research

The paper highlights that this sensitivity has led to misleading results in existing literature. By reproducing previous work, the author discovered that some reported successes in OOD detection were largely driven by the fact that the OOD datasets were evaluated with more classes than the ID datasets. When the number of classes is corrected to be equal, the performance of these uncertainty metrics often drops significantly. This suggests that current methods for detecting OOD inputs in MCQA-based LLMs may be less effective than previously claimed.

Moving Forward

The research concludes that clear, consistent definitions of ID and OOD are essential when working with LLMs. Because MCQA tasks involve varying structures, researchers must ensure that the effective class cardinality ($K$) remains constant when comparing performance across different datasets. The author argues that future studies must distinguish between true domain shifts and simple task-format shifts to avoid being misled by inflated metrics. Ultimately, the paper serves as a cautionary note for the field, emphasizing that evaluation procedures must be as rigorous as the models themselves.

Comments (0)

No comments yet

Be the first to share your thoughts!