SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering
Large language models are powerful tools for complex reasoning, but they often struggle to evaluate their own intermediate steps. In tasks like medical or legal question answering, a model might take a logically flawed or "risky" path but still arrive at the correct final answer by chance. Existing reward models often fail to catch these errors because they "compensate" for bad steps with later good ones, leading to unreliable reasoning. This paper introduces the Schema-aware Cumulative Process Reward Model (SCPRM) to solve this by providing a more rigorous, step-by-step evaluation of reasoning paths on knowledge graphs.
The Problem with Current Reward Models
Traditional process reward models often suffer from a "risk compensation" effect, where a model is rewarded for the final outcome even if the path taken to get there was logically unsound. In knowledge graph reasoning, where multiple paths may exist between entities, this is particularly dangerous for sensitive fields like law or medicine. Furthermore, other conditional reward models often suffer from "length bias," where they unfairly penalize longer, more complex reasoning paths simply because they involve more steps.
How SCPRM Works
SCPRM addresses these issues by breaking down reasoning evaluation into two distinct components:
Cumulative Past Reward: Instead of just looking at the final result, this component tracks the safety of every step taken. It uses a logarithmic summation of probabilities to ensure that if a risky step is taken, it imposes a persistent penalty that cannot be "offset" by later correct steps.
Schema-aware Future Reward: This module estimates how likely a path is to reach the correct target by comparing the current reasoning path against a "query schema"—a logical map extracted from the user's question. By measuring the "schema distance" between the current step and the implicit goal, the model can guide the search process without being biased by the length of the path.
Integrating with Search
The researchers integrated SCPRM into a Monte Carlo Tree Search (MCTS) framework, creating "SCPRM-MCTS." This allows the model to perform multi-hop reasoning across a knowledge graph by using the reward model to evaluate potential paths in real-time. During the search, the model selects the most promising next steps based on the cumulative and future rewards, effectively pruning paths that contain risky or irrelevant information before they can lead to a hallucinated answer.
Performance and Results
The researchers tested SCPRM-MCTS on medical and legal knowledge graph datasets, as well as the Complex WebQuestions (CWQ) dataset. The results showed that SCPRM-MCTS outperformed strong baselines, achieving an average improvement of 1.18% in Hits@k. These findings suggest that by explicitly penalizing risky steps and aligning reasoning with the logical structure of the query, the model provides a more accurate and reliable way to navigate complex knowledge graphs in high-stakes domains.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!