ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning
Large Language Models (LLMs) often struggle with long-horizon reasoning tasks because they tend to "overthink," generating excessively long sequences that lead to memory errors, redundant steps, and the exhaustion of their reasoning budget. While existing methods often rely on external tools or complex systems to manage these long contexts, they can be difficult to implement and may lack transparency. ReSum introduces a new framework that enables LLMs to manage their own reasoning trajectories through self-summarization, allowing the model to compress its history and recover from errors without needing external intervention.
The Power of Self-Summarization
The researchers conducted pilot studies revealing that LLMs naturally exhibit a "self-summarization" behavior when they reach a state of high uncertainty. By analyzing token entropy, they found that the model’s uncertainty drops significantly after it generates a summary. Furthermore, when the researchers manually injected a "summarization" phrase into the middle of a flawed reasoning chain, the model’s ability to reach the correct final answer improved by up to 30%. This suggests that self-summarization is an intrinsic control mechanism that helps the model consolidate its thoughts and correct its course.
How ReSum Works
ReSum uses a tree-based reinforcement learning approach to teach the model when and how to summarize. It creates "rollout trees" by branching off from a model's initial reasoning path in two ways:
Artifact Points: The system injects a summarization phrase into non-summarization positions to test if summarizing at that specific moment improves the outcome.
Natural Points: The system identifies where the model spontaneously summarized and masks that phrase to see if the summary was actually helpful.
By comparing these different branches, the model learns to favor paths where summarization leads to a correct final answer. The framework uses a "summarization-aware advantage" calculation, which provides fine-grained feedback to the model, reinforcing effective summarization timing while discouraging unnecessary or unhelpful summaries.
Key Results
The implementation of ReSum demonstrates that models can learn to be more efficient and accurate simultaneously. By internalizing the ability to summarize, the model no longer needs to rely on external modules to manage its context. In experiments, ReSum improved reasoning performance by an average of 4% while reducing the total length of the reasoning chains by 18.6%. This indicates that the model is not only getting the right answers more often but is also doing so more concisely, effectively avoiding the pitfalls of overthinking and context bloat.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!