Back to AI Research

AI Research

ReSum: Synergizing LLM Reasoning and Summarization... | AI Research

Key Takeaways

  • ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning Large Language Models (LLMs) often struggle with long-horizon reasoning tasks...
  • Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs).
  • However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget.
  • Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory.
  • To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organize their reasoning trajectories through self-summarization.
Paper AbstractExpand

Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs). However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget. Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory. To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organize their reasoning trajectories through self-summarization. Our pilot studies show that self-summarization stabilizes generation by lowering token-level entropy, and that introducing a ``summarization'' phrase can substantially mitigate errors propagated from an incorrect rollout prefix. Motivated by these findings, ReSum adopts a summarization-aware adaptive rollout mechanism that contrastively evaluates whether self-summarization benefits the ongoing reasoning process. Specifically, when the model spontaneously triggers self-summarization, ReSum masks the summarization phrase to create a contrastive branch; for non-summarization positions, it instead randomly injects the phrase to create a matched branch. We further design a summarization-aware advantage to enable finer-grained comparison between contrastive rollout trajectories. Extensive experiments show that ReSum improves performance at an average of 4\% while reducing rollout length by 18.6\%.

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning
Large Language Models (LLMs) often struggle with long-horizon reasoning tasks because they tend to "overthink," generating excessively long sequences that lead to memory errors, redundant steps, and the exhaustion of their reasoning budget. While existing methods often rely on external tools or complex systems to manage these long contexts, they can be difficult to implement and may lack transparency. ReSum introduces a new framework that enables LLMs to manage their own reasoning trajectories through self-summarization, allowing the model to compress its history and recover from errors without needing external intervention.

The Power of Self-Summarization

The researchers conducted pilot studies revealing that LLMs naturally exhibit a "self-summarization" behavior when they reach a state of high uncertainty. By analyzing token entropy, they found that the model’s uncertainty drops significantly after it generates a summary. Furthermore, when the researchers manually injected a "summarization" phrase into the middle of a flawed reasoning chain, the model’s ability to reach the correct final answer improved by up to 30%. This suggests that self-summarization is an intrinsic control mechanism that helps the model consolidate its thoughts and correct its course.

How ReSum Works

ReSum uses a tree-based reinforcement learning approach to teach the model when and how to summarize. It creates "rollout trees" by branching off from a model's initial reasoning path in two ways:

  • Artifact Points: The system injects a summarization phrase into non-summarization positions to test if summarizing at that specific moment improves the outcome.

  • Natural Points: The system identifies where the model spontaneously summarized and masks that phrase to see if the summary was actually helpful.
    By comparing these different branches, the model learns to favor paths where summarization leads to a correct final answer. The framework uses a "summarization-aware advantage" calculation, which provides fine-grained feedback to the model, reinforcing effective summarization timing while discouraging unnecessary or unhelpful summaries.

Key Results

The implementation of ReSum demonstrates that models can learn to be more efficient and accurate simultaneously. By internalizing the ability to summarize, the model no longer needs to rely on external modules to manage its context. In experiments, ReSum improved reasoning performance by an average of 4% while reducing the total length of the reasoning chains by 18.6%. This indicates that the model is not only getting the right answers more often but is also doing so more concisely, effectively avoiding the pitfalls of overthinking and context bloat.

Comments (0)

No comments yet

Be the first to share your thoughts!