AI Research

Beyond Accuracy: Measuring Logical Compliance of Pr... | AI Research

Key Takeaways

Beyond Accuracy: Measuring Logical Compliance of Predictive Models In high-stakes fields like healthcare, finance, and autonomous systems, machine learning m...
Machine learning models are predominantly evaluated through predictive performance metrics such as ranking quality, prediction error, or classification accuracy.
While these metrics effectively quantify how closely predictions match the ground truth, they do not assess whether model outputs respect predefined logical or domain-specific constraints.
In high-stakes applications, including healthcare, finance, and autonomous systems, logical consistency can be as critical as predictive accuracy, yet no standard metric captures this dimension.
We introduce the Rule Violation Score (RVS), a complementary evaluation metric that quantifies the extent to which a predictive model respects a given set of logical rules, independently of predictive accuracy.

Paper AbstractExpand

Machine learning models are predominantly evaluated through predictive performance metrics such as ranking quality, prediction error, or classification accuracy. While these metrics effectively quantify how closely predictions match the ground truth, they do not assess whether model outputs respect predefined logical or domain-specific constraints. In high-stakes applications, including healthcare, finance, and autonomous systems, logical consistency can be as critical as predictive accuracy, yet no standard metric captures this dimension. We introduce the Rule Violation Score (RVS), a complementary evaluation metric that quantifies the extent to which a predictive model respects a given set of logical rules, independently of predictive accuracy. RVS treats hard rules (strict constraints) and soft rules (statistical regularities) differently, can be evaluated on any dataset and on any predictive model expressed over a relational vocabulary, and can be computed using SQL queries that are automatically generated for Horn rules. Beyond evaluating models, RVS can also evaluate the logical consistency of training datasets and help identify poorly defined rules. We evaluate RVS on three benchmarks covering knowledge graph link prediction and relational regression, including rule-based, embedding-based, and neuro-symbolic predictive models. Our results demonstrate that two models achieving comparable predictive accuracy can exhibit substantially different levels of logical compliance, revealing differences in model behavior that standard metrics fail to capture.

Beyond Accuracy: Measuring Logical Compliance of Predictive Models

In high-stakes fields like healthcare, finance, and autonomous systems, machine learning models are often judged solely by their predictive accuracy. However, a model that is highly accurate may still produce results that violate fundamental logical or domain-specific rules. This paper introduces the Rule Violation Score (RVS), a new evaluation metric designed to measure how well a model adheres to logical constraints, independent of its predictive performance. By providing a way to quantify logical consistency, the authors offer a tool to ensure that models are not just statistically correct, but also logically sound.

The Problem with Accuracy Alone

Standard evaluation metrics—such as classification accuracy, prediction error, and ranking quality—focus on how closely a model’s output matches ground truth data. While these metrics are useful, they fail to capture whether a model respects the underlying logic of a domain. For example, a model might predict a result that is statistically likely but logically impossible based on established rules. Because current metrics do not account for this, they can mask significant behavioral flaws in models used for critical decision-making.

How the Rule Violation Score Works

The RVS provides a complementary way to assess models by measuring their compliance with a predefined set of logical rules. The framework distinguishes between two types of rules:

Hard Rules: These represent strict constraints that must be followed.
Soft Rules: These represent statistical regularities or tendencies.
The RVS is versatile, as it can be applied to any predictive model that uses a relational vocabulary. For Horn rules, the score can be calculated automatically using SQL queries, making it a practical tool for developers. Beyond evaluating the models themselves, the RVS can also be used to check the logical consistency of training datasets and to identify rules that may be poorly defined.

Insights from Benchmarking

The authors tested the RVS across three different benchmarks, including knowledge graph link prediction and relational regression. They evaluated a variety of model types, such as rule-based, embedding-based, and neuro-symbolic models. The results revealed a critical insight: two models can achieve nearly identical levels of predictive accuracy while showing vastly different levels of logical compliance. This demonstrates that the RVS captures essential information about model behavior that traditional accuracy-based metrics completely overlook, providing a more comprehensive view of model reliability.

Comments (0)

No comments yet

Be the first to share your thoughts!