Back to AI Research

AI Research

Hidden Forgetting in Continual Multimodal Learning:... | AI Research

Key Takeaways

  • Hidden Forgetting in Continual Multimodal Learning: When Accuracy Survives but Grounding Fails Multimodal Large Language Models (MLLMs) are increasingly used...
  • We study this overlooked failure mode and ask whether a continually adapted MLLM can preserve not only what it answers, but also how it uses visual, textual, OCR, chart, and document evidence.
  • These results suggest that robust continual multimodal learning requires preserving the evidence path behind correct answers, not merely the answers themselves.
  • Hidden Forgetting in Continual Multimodal Learning: When Accuracy Survives but Grounding Fails
  • Multimodal Large Language Models (MLLMs) are increasingly used for complex tasks like document understanding and visual reasoning.
Paper AbstractExpand

Multimodal large language models must continually adapt to evolving tasks and domains, yet standard continual learning metrics mainly measure whether old answers remain correct, leaving the stability of multimodal grounding largely unexamined. We study this overlooked failure mode and ask whether a continually adapted MLLM can preserve not only what it answers, but also how it uses visual, textual, OCR, chart, and document evidence. We identify \emph{hidden evidence-use forgetting}, where answer accuracy is retained while the model silently shifts toward different or less grounded evidence channels, and propose \textsc{RCL}, a replay-free reliance-constrained continual learning framework. \textsc{RCL} freezes the previous checkpoint as a behavioral reference, estimates teacher and student evidence-reliance profiles through counterfactual channel interventions, and jointly optimizes task learning, prediction preservation, and reliance preservation without adding inference-time cost. Across CoIN, COAST, MCITlib, and an evidence-sensitive multimodal stream, \textsc{RCL} consistently improves final performance and reduces forgetting over replay-free, PEFT, routing, and memory-assisted baselines, while substantially lowering modality reliance drift, dominant evidence flips, and hidden forgetting rates. These results suggest that robust continual multimodal learning requires preserving the evidence path behind correct answers, not merely the answers themselves.

Hidden Forgetting in Continual Multimodal Learning: When Accuracy Survives but Grounding Fails
Multimodal Large Language Models (MLLMs) are increasingly used for complex tasks like document understanding and visual reasoning. Because these models must constantly adapt to new information, they often undergo "continual learning." However, current methods for evaluating these models focus almost exclusively on whether they still provide the correct answer. This paper identifies a critical flaw in this approach: a model might still give the right answer while completely changing the way it arrives at that conclusion. The authors call this "hidden evidence-use forgetting," where a model stops relying on actual visual or textual evidence and instead shifts toward unreliable shortcuts.

The Problem of Hidden Forgetting

Standard evaluation metrics only check if a model’s final output remains accurate after it learns new tasks. The authors discovered that this creates a false sense of stability. In many cases, an MLLM might retain the correct answer but stop "looking" at the relevant parts of an image or document, instead relying on superficial patterns or language biases. This is dangerous for safety-sensitive applications, as the model is no longer "grounded" in the actual evidence provided. The researchers found that even when models are trained to preserve their previous outputs, they often still exhibit this hidden drift in how they process information.

How RCL Works

To address this, the authors introduced a framework called Reliance-Constrained Continual Learning (RCL). Instead of just trying to force the model to keep the same answers, RCL forces the model to keep the same "evidence path." It treats the previous version of the model as a teacher. During training, it uses a technique called counterfactual channel intervention—essentially testing how the model reacts when specific pieces of evidence (like an image, a chart, or OCR text) are hidden. By comparing how the "teacher" model and the "student" model rely on these different channels, RCL ensures the student continues to prioritize the same evidence as its predecessor. This process happens during training without requiring the model to store old data, and it adds no extra computational cost when the model is actually being used.

Key Results

The researchers tested RCL across several benchmarks, including CoIN, COAST, and MCITlib. The results showed that RCL consistently outperformed existing methods that rely on parameter-saving or output-distillation alone. By explicitly constraining the model’s reliance profile, RCL significantly reduced "hidden forgetting" and ensured that the model’s reasoning remained consistent over time. The study demonstrates that for MLLMs to be truly robust, they must be held accountable not just for what they say, but for the evidence they use to support their claims.

Why This Matters

This research highlights that accuracy is not a complete measure of intelligence or reliability in multimodal systems. If a model is correct for the wrong reasons, it is prone to failure when faced with new, unseen scenarios. By shifting the focus from simple answer-matching to evidence-based stability, the authors provide a new way to build models that are not only smarter but more transparent and trustworthy in their reasoning processes.

Comments (0)

No comments yet

Be the first to share your thoughts!