From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification
Automated fact verification is a complex task that requires more than just a simple "true" or "false" classification. It involves a multi-stage process: breaking down a claim, gathering evidence, and synthesizing that information into a final verdict. Current systems often handle these stages separately or rely on rigid, pre-set rules, which can lead to disjointed reasoning. This paper introduces ProFact, an agentic framework that treats fact verification as a unified, long-horizon decision-making process, allowing the system to optimize every step of the verification journey end-to-end.
A Unified Approach to Verification
ProFact moves away from isolated modules by training a single policy to manage the entire verification trajectory. The system operates in three distinct stages: Question, Search, and Verdict. First, it decomposes a complex claim into specific verification questions. Next, it performs targeted evidence retrieval and generates answers grounded in that evidence. Finally, it uses the accumulated information to predict the veracity of the original claim. By training these stages together, the model ensures that early decisions—like how a claim is decomposed—are directly informed by the requirements of the final verdict.
Learning Through Process-Aware Rewards
A major challenge in training these systems is that the final veracity label is often the only feedback provided, which is too sparse to teach the model how to improve its intermediate steps. To solve this, ProFact introduces "process-aware rewards." Instead of waiting until the end, the system receives dense, stage-level feedback throughout the process. It uses a matching score to evaluate the quality of the generated questions and the accuracy of the evidence-grounded answers. This allows the model to learn which specific actions contribute to a correct conclusion, effectively turning a long-horizon task into a series of learnable, supervised steps.
Improved Performance and Efficiency
The researchers evaluated ProFact across several open-source language models using the AVeriTeC benchmark. The results demonstrate that ProFact consistently outperforms existing baseline methods in both final verification accuracy and the quality of the intermediate evidence gathered. By optimizing the entire trajectory rather than individual parts, the system achieves better alignment between evidence seeking and final judgment. The experiments also highlight that the process-aware reward mechanism is critical; removing this intermediate feedback significantly degrades performance, confirming that the model benefits from learning at every stage of the verification process.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!