Uncertainty-Aware Hybrid Retrieval for Long-Document RAG addresses a fundamental challenge in Retrieval Augmented Generation (RAG): the trade-off between retrieval granularity and context quality. When retrieval units are too large, they often include irrelevant information that distracts the language model. When they are too small, they may lack the necessary context for the model to understand the retrieved evidence. This paper introduces a framework that dynamically balances these needs without requiring additional training or modifications to the underlying language model.
The Granularity Trade-off
The core problem is that neither coarse nor fine-grained retrieval is universally effective. Coarse units preserve document-level context but introduce noise, which can lead to the "lost-in-the-middle" phenomenon where models struggle to use evidence buried in long prompts. Conversely, fine-grained units are compact and precise but are often difficult to retrieve reliably because they may lack the semantic or lexical cues required to match a specific query. The authors propose that an effective system should be able to leverage both, using fine-grained units for precision and broader units for coherence.
How the Framework Works
The proposed solution, UMG-RAG, treats chunk granularity as a query-specific reliability estimation problem. It uses existing dense and sparse retrievers as "experts" to retrieve candidates across multiple granularities. Instead of using fixed weights to combine these results, the system calculates the "sharpness" of each expert's score distribution.
If an expert produces a peaked distribution, it indicates high confidence in its results; a flat distribution suggests high uncertainty. By calculating the entropy of these distributions, the framework assigns higher importance to the most reliable retrieval sources for each specific query.
Parent Promotion for Coherence
To further improve performance, the authors introduced UMGP-RAG, a variant that incorporates "parent promotion." In this approach, the system uses fine-grained chunks to precisely locate relevant information, but then returns a broader "parent" chunk to the generator to ensure the context remains coherent. The system also includes a deduplication mechanism that removes redundant or overlapping chunks, ensuring the final context provided to the generator is both compact and informative.
Performance and Flexibility
Experiments on question-answering benchmarks demonstrate that this uncertainty-aware fusion and parent promotion strategy improves generation quality across various retrievers and language models. Because the framework is training-free and plug-and-play, it can be integrated into existing RAG pipelines without the need for complex model retraining or the construction of additional data structures like knowledge graphs. This makes it a lightweight, adaptable solution for improving long-document RAG performance.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!