Back to AI Research

AI Research

Evidence-State Rewards for Long-Context Reasoning | AI Research

Key Takeaways

  • Evidence-State Rewards for Long-Context Reasoning introduces Maven (Marginal-Value Evidence Navigation), a reinforcement learning framework designed to impro...
  • Long-context reasoning requires models to locate, revise, and synthesize evidence distributed across lengthy inputs.
  • Existing long-context RL methods usually reward final answers or static evidence extraction, offering little feedback on how intermediate actions change the model's evidence state.
  • We propose Maven, a reinforcement learning framework with an editable evidence memory.
  • These rewards are assigned to the corresponding action spans in GRPO.
Paper AbstractExpand

Long-context reasoning requires models to locate, revise, and synthesize evidence distributed across lengthy inputs. Existing long-context RL methods usually reward final answers or static evidence extraction, offering little feedback on how intermediate actions change the model's evidence state. We propose Maven, a reinforcement learning framework with an editable evidence memory. Maven defines an answer-conditioned evidence-state value and rewards action-level state transitions: add actions are credited by marginal gain and hindsight contribution, link actions by evidence synergy, and drop actions by improved answer support after removing misleading evidence. These rewards are assigned to the corresponding action spans in GRPO. Across Llama and Qwen models on LongBench v2, LongReason, and RULER, Maven outperforms outcome-only RL and evidence-identification baselines, producing more sufficient evidence sets and lower distractor retention. Our results show that long-context RL benefits from optimizing stateful evidence navigation rather than one-shot evidence extraction.

Evidence-State Rewards for Long-Context Reasoning introduces Maven (Marginal-Value Evidence Navigation), a reinforcement learning framework designed to improve how Large Language Models (LLMs) reason over massive amounts of information. While many models can process long inputs, they often struggle to synthesize information, frequently getting distracted by irrelevant data or failing to connect distant pieces of evidence. Maven addresses this by training models to treat reasoning as a dynamic process of building and refining an "evidence memory" rather than just performing a one-shot search.

Moving Beyond One-Shot Extraction

Traditional reinforcement learning for long-context tasks often rewards models only for finding the correct final answer or for identifying a static set of evidence. The authors argue that this is insufficient because complex reasoning requires an evolving process. A piece of information might seem relevant initially but later prove to be a distractor, or two pieces of information might only become useful when linked together. Maven shifts the focus from static identification to "stateful navigation," where the model learns to actively manage its evidence.

How Maven Works

Maven provides the model with an editable memory interface that allows it to perform four specific actions:

  • Add: Include a new piece of evidence from the context.

  • Drop: Remove evidence that is redundant or misleading.

  • Link: Explicitly connect two pieces of evidence to show how they jointly support the answer.

  • Answer: Produce the final response once the evidence is sufficient.
    Instead of only rewarding the final answer, Maven assigns "process rewards" to each of these actions. For example, an "add" action is rewarded based on whether it provides new, useful information, while a "drop" action is rewarded if it improves the model's ability to predict the correct answer by removing noise. By using a frozen verifier model to calculate these rewards, Maven provides dense, step-by-step feedback that guides the model toward more effective reasoning strategies.

Performance and Results

The researchers tested Maven across several model families, including Llama and Qwen, on benchmarks like LongBench v2, LongReason, and RULER. The results show that Maven consistently outperforms standard reinforcement learning methods. Models trained with Maven demonstrated higher "evidence sufficiency"—meaning they were better at capturing the necessary information—and lower "distractor retention," meaning they were better at ignoring irrelevant or misleading content. These gains were particularly pronounced in tasks involving very long contexts, where the ability to revise the evidence set is most critical.

Key Takeaways

The study suggests that the primary bottleneck in long-context reasoning is not just the ability to "read" long inputs, but the ability to manage information effectively. By rewarding the intermediate steps of building, linking, and pruning evidence, Maven helps models avoid the "lost in the middle" phenomenon and premature reliance on partial information. The framework demonstrates that optimizing for the process of navigation, rather than just the final outcome, is a more robust way to train models for complex, multi-step reasoning tasks.

Comments (0)

No comments yet

Be the first to share your thoughts!