Beyond Accuracy: Measuring Logical Compliance of Predictive Models
In high-stakes fields like healthcare, finance, and autonomous systems, machine learning models are often judged solely by their predictive accuracy. However, a model that is highly accurate may still produce results that violate fundamental logical or domain-specific rules. This paper introduces the Rule Violation Score (RVS), a new evaluation metric designed to measure how well a model adheres to logical constraints, independent of its predictive performance. By providing a way to quantify logical consistency, the authors offer a tool to ensure that models are not just statistically correct, but also logically sound.
The Problem with Accuracy Alone
Standard evaluation metrics—such as classification accuracy, prediction error, and ranking quality—focus on how closely a model’s output matches ground truth data. While these metrics are useful, they fail to capture whether a model respects the underlying logic of a domain. For example, a model might predict a result that is statistically likely but logically impossible based on established rules. Because current metrics do not account for this, they can mask significant behavioral flaws in models used for critical decision-making.
How the Rule Violation Score Works
The RVS provides a complementary way to assess models by measuring their compliance with a predefined set of logical rules. The framework distinguishes between two types of rules:
Hard Rules: These represent strict constraints that must be followed.
Soft Rules: These represent statistical regularities or tendencies.
The RVS is versatile, as it can be applied to any predictive model that uses a relational vocabulary. For Horn rules, the score can be calculated automatically using SQL queries, making it a practical tool for developers. Beyond evaluating the models themselves, the RVS can also be used to check the logical consistency of training datasets and to identify rules that may be poorly defined.
Insights from Benchmarking
The authors tested the RVS across three different benchmarks, including knowledge graph link prediction and relational regression. They evaluated a variety of model types, such as rule-based, embedding-based, and neuro-symbolic models. The results revealed a critical insight: two models can achieve nearly identical levels of predictive accuracy while showing vastly different levels of logical compliance. This demonstrates that the RVS captures essential information about model behavior that traditional accuracy-based metrics completely overlook, providing a more comprehensive view of model reliability.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!