Alibaba’s Tongyi Lab has introduced VimRAG, a new multimodal Retrieval-Augmented Generation (RAG) framework designed to overcome the limitations of standard agentic models when processing massive visual contexts. While traditional RAG techniques excel with text, they often struggle with the high token density and semantic sparsity of images and videos. VimRAG addresses these challenges by replacing linear interaction histories with a dynamic memory graph, allowing agents to navigate complex visual data without succumbing to the state blindness or repetitive search patterns common in current systems.
A Multimodal Memory Graph Architecture
VimRAG moves away from the flat, growing interaction histories used by standard ReAct agents. Instead, it models the reasoning process as a dynamic directed acyclic graph. Each node in this graph encodes a sub-query, a textual summary, and a multimodal episodic memory bank containing visual tokens. By sampling from three distinct action types—exploratory retrieval, multimodal perception, and terminal projection—the framework can distill raw observations into concise summaries while maintaining a structured record of the agent's reasoning path. For video data, the framework leverages the temporal grounding capabilities of Qwen3-VL to extract keyframes aligned with specific timestamps.
Optimized Visual Memory and Policy Training
To manage the heavy computational load of visual data, VimRAG utilizes Graph-Modulated Visual Memory Encoding. This component treats token assignment as a resource allocation problem, calculating the intrinsic energy of visual items based on semantic priority, structural relevance within the graph, and temporal decay. This ensures that high-resolution tokens are dynamically allocated to the most critical evidence. Furthermore, the framework employs Graph-Guided Policy Optimization (GGPO) to refine training. By applying gradient masks to the graph, VimRAG prevents the model from incorrectly reinforcing redundant retrieval steps in successful trajectories or penalizing valuable retrieval steps in failed ones.
Performance and Benchmarking
The effectiveness of VimRAG was evaluated across nine benchmarks, including HotpotQA, SQuAD, WebQA, and the newly constructed XVBench, which focuses on cross-video understanding. Testing on a unified corpus of approximately 200k interleaved multimodal items demonstrated significant performance gains. When using the Qwen3-VL-8B-Instruct model, VimRAG achieved an overall score of 50.1, outperforming the prior best baseline of 43.6. Similar improvements were observed on the 4B backbone, where the framework scored 45.2 compared to the baseline's 40.6. Despite the addition of a dedicated perception step, VimRAG successfully reduced total trajectory length by preventing the repetitive re-reading of data that often plagues linear RAG methods.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!