Rethinking Vacuity for OOD Detection in Evidential Deep Learning
This paper investigates a critical evaluation flaw in Evidential Deep Learning (EDL) when applied to Large Language Models (LLMs). Specifically, it examines how the metric used to measure uncertainty—known as "vacuity" or Uncertainty Mass (UM)—is highly sensitive to the number of classes ($K$) in a dataset. The research demonstrates that when the number of classes differs between In-Distribution (ID) and Out-of-Distribution (OOD) data, performance metrics like AUROC and AUPR can be artificially inflated, creating a false impression of a model's ability to detect OOD inputs.
The Problem with Vacuity
In EDL, vacuity is calculated as $K/S$, where $K$ is the number of classes and $S$ is the model's total strength of belief. Because this formula relies directly on $K$, the metric is not naturally invariant to changes in the number of answer options. If a model is evaluated on a four-class dataset for ID and a five-class dataset for OOD, the resulting uncertainty scores shift simply because of the change in dimensionality, rather than because the model is actually "uncertain" about the input. The author proves that vacuity only remains stable during class expansion if the new class receives a specific, mathematically precise amount of evidence, which rarely happens in practice.
Evaluation Artefacts in LLMs
When applying EDL to LLMs using Multiple-Choice Question-Answer (MCQA) datasets, the "classes" are often placeholders (like A, B, C, D) that change meaning with every question. Because different datasets have different numbers of answer options, researchers often inadvertently compare models across mismatched class cardinalities. The paper shows that this mismatch acts as an evaluation artefact. By isolating $K$ in experiments, the research demonstrates that simply increasing the number of classes in the OOD evaluation set—without changing the model's actual predictions—can cause AUROC scores to jump by as much as 0.360.
Impact on Current Research
The paper highlights that this sensitivity has led to misleading results in existing literature. By reproducing previous work, the author discovered that some reported successes in OOD detection were largely driven by the fact that the OOD datasets were evaluated with more classes than the ID datasets. When the number of classes is corrected to be equal, the performance of these uncertainty metrics often drops significantly. This suggests that current methods for detecting OOD inputs in MCQA-based LLMs may be less effective than previously claimed.
Moving Forward
The research concludes that clear, consistent definitions of ID and OOD are essential when working with LLMs. Because MCQA tasks involve varying structures, researchers must ensure that the effective class cardinality ($K$) remains constant when comparing performance across different datasets. The author argues that future studies must distinguish between true domain shifts and simple task-format shifts to avoid being misled by inflated metrics. Ultimately, the paper serves as a cautionary note for the field, emphasizing that evaluation procedures must be as rigorous as the models themselves.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!