Back to AI Research

AI Research

From Verdict to Process: Agentic Reinforcement Lear... | AI Research

Key Takeaways

  • From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification Automated fact verification is a complex task that requires more th...
  • Recent approaches combining Large Language Models (LLMs) with retrieval-augmented reasoning have shown promise for automated fact verification.
  • To process complex claims, these verification pipelines typically execute multi-stage workflows that coordinate tightly coupled modules, including claim decomposition, evidence gathering, and verdict prediction.
  • However, existing methods optimize individual stages in isolation or rely on fixed heuristics, which limits adaptive coordination among stages and can lead to suboptimal outcomes.
  • In this work, we propose ProFact, an agentic reinforcement learning framework for end-to-end optimization of multi-stage fact verification trajectories.
Paper AbstractExpand

Recent approaches combining Large Language Models (LLMs) with retrieval-augmented reasoning have shown promise for automated fact verification. To process complex claims, these verification pipelines typically execute multi-stage workflows that coordinate tightly coupled modules, including claim decomposition, evidence gathering, and verdict prediction. However, existing methods optimize individual stages in isolation or rely on fixed heuristics, which limits adaptive coordination among stages and can lead to suboptimal outcomes. In this work, we propose ProFact, an agentic reinforcement learning framework for end-to-end optimization of multi-stage fact verification trajectories. ProFact trains a unified policy to coordinate claim decomposition, evidence seeking, answer generation, and verdict prediction. To address the sparse and delayed supervision provided by final veracity labels, ProFact introduces process-aware rewards that provide stage-level learning signals throughout the verification process. Empirical evaluation shows that ProFact consistently outperforms strong baselines in both verification performance and inference efficiency. These results highlight the effectiveness of process-aware trajectory optimization for multi-stage fact verification.

From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification
Automated fact verification is a complex task that requires more than just a simple "true" or "false" classification. It involves a multi-stage process: breaking down a claim, gathering evidence, and synthesizing that information into a final verdict. Current systems often handle these stages separately or rely on rigid, pre-set rules, which can lead to disjointed reasoning. This paper introduces ProFact, an agentic framework that treats fact verification as a unified, long-horizon decision-making process, allowing the system to optimize every step of the verification journey end-to-end.

A Unified Approach to Verification

ProFact moves away from isolated modules by training a single policy to manage the entire verification trajectory. The system operates in three distinct stages: Question, Search, and Verdict. First, it decomposes a complex claim into specific verification questions. Next, it performs targeted evidence retrieval and generates answers grounded in that evidence. Finally, it uses the accumulated information to predict the veracity of the original claim. By training these stages together, the model ensures that early decisions—like how a claim is decomposed—are directly informed by the requirements of the final verdict.

Learning Through Process-Aware Rewards

A major challenge in training these systems is that the final veracity label is often the only feedback provided, which is too sparse to teach the model how to improve its intermediate steps. To solve this, ProFact introduces "process-aware rewards." Instead of waiting until the end, the system receives dense, stage-level feedback throughout the process. It uses a matching score to evaluate the quality of the generated questions and the accuracy of the evidence-grounded answers. This allows the model to learn which specific actions contribute to a correct conclusion, effectively turning a long-horizon task into a series of learnable, supervised steps.

Improved Performance and Efficiency

The researchers evaluated ProFact across several open-source language models using the AVeriTeC benchmark. The results demonstrate that ProFact consistently outperforms existing baseline methods in both final verification accuracy and the quality of the intermediate evidence gathered. By optimizing the entire trajectory rather than individual parts, the system achieves better alignment between evidence seeking and final judgment. The experiments also highlight that the process-aware reward mechanism is critical; removing this intermediate feedback significantly degrades performance, confirming that the model benefits from learning at every stage of the verification process.

Comments (0)

No comments yet

Be the first to share your thoughts!