Large language models (LLMs) are increasingly capable of processing massive amounts of information, such as entire books or long document collections, within a single prompt. However, even when the necessary information is present in the input, these models often struggle to effectively locate and use it to answer questions. The paper "ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning" introduces a new inference-time method called RECONTEXT to bridge this gap. Instead of modifying the model's internal architecture or pruning the input, RECONTEXT acts as a "harness" that dynamically identifies and highlights relevant evidence to help the model generate more accurate, grounded answers.
How RECONTEXT Works
RECONTEXT operates as a training-free, iterative process that runs before the final answer is generated. First, it uses the model’s own internal attention signals to identify which parts of the long context are most relevant to the user's question. These important snippets are then "materialized" into an evidence pool.
The process is recursive: the model reads the original prompt along with the newly created evidence pool, which helps it refine its focus. Over a small number of rounds, the model updates this pool, adding more specific supporting information. Crucially, the original full context remains available to the model at all times; the evidence pool simply serves as a scaffold to emphasize key details and guide the model's reasoning.
Theoretical Foundation
The researchers explain the effectiveness of RECONTEXT through the lens of associative memory. In this framework, the long context is viewed as a "memory store," the question acts as a "retrieval cue," and the model's attention mechanism functions as a way to associate the cue with specific traces of information. By replaying these selected traces—the evidence pool—before the final generation, the method effectively "reactivates" the most important memories. The authors provide a mathematical proof showing that this recursive replay process consistently moves the model's internal representation closer to the correct answer.
Performance and Results
The researchers tested RECONTEXT across eight different long-context datasets, each with a 128K-token context length, using three different LLM backbones: Qwen3-4B, Qwen3-8B, and Llama3-8B. The results showed that RECONTEXT consistently outperformed standard prompting and other existing methods. On average, the method improved accuracy by 24.6% compared to the baseline. It achieved the best average rank across all three model families, demonstrating that explicitly replaying evidence is a highly effective way to improve reasoning without requiring additional training or complex external memory systems.
Key Takeaways
RECONTEXT offers a lightweight, flexible solution for long-context tasks. Because it does not require training or invasive changes to the model's logic, it can be applied to existing LLMs as an inference-time wrapper. By treating the model's internal relevance signals as a guide for evidence selection, it helps overcome the common failure mode where models "forget" or ignore relevant information buried deep within long inputs. The method proves that simply organizing and emphasizing existing information is often enough to significantly boost a model's reasoning capabilities.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!