InduceKV: Fixed-Footprint Continual Adaptation of Multimodal LLMs via Inducing KV Memories
Multimodal large language models (MLLMs) are increasingly used for diverse tasks, but teaching them new skills over time is difficult. Typically, this requires updating the model's internal parameters, which can lead to "catastrophic forgetting" of previous knowledge or require an ever-growing amount of storage. This paper introduces InduceKV, a method that allows MLLMs to adapt to new tasks without changing their core parameters. Instead, it stores task-specific information in a compact, external memory that the model can access during its normal operation.
How InduceKV Works
Rather than retraining the model, InduceKV treats continual learning as a memory management problem. When a new task arrives, the model extracts "key-value" (KV) payloads—essentially compressed representations of the task's information—and stores them as memory entries.
To keep the system efficient, the method uses a "bilevel selection" process. It balances three competing goals: ensuring the model performs well on the current task, retaining accuracy on historical tasks (using a small set of "anchor" data), and ensuring the stored memories are diverse and not redundant. By using a mathematical regularizer, the system avoids storing similar or repetitive information, ensuring the fixed memory budget is used as effectively as possible.
Integration with the Model
InduceKV does not require the model to learn new behaviors through gradient updates. Instead, it acts as a retrieval-based system. When the model processes a new input, it uses a lightweight calibration interface to retrieve the most relevant memory entries. These entries are then injected directly into the model's self-attention mechanism—the same pathway the model uses to process its own internal cache. This allows the model to "read" the relevant task-specific knowledge during generation, effectively adapting its output without ever modifying its underlying weights.
Performance and Results
The researchers tested InduceKV across several challenging scenarios, including task-incremental instruction tuning, continual visual question answering (VQA), and domain-incremental adaptation. In these tests, InduceKV consistently outperformed existing methods like Parameter-Efficient Fine-Tuning (PEFT), Mixture-of-Experts (MoE), and standard replay-based approaches. The authors also conducted diagnostic tests to confirm that these performance gains were due to the effectiveness of the memory-induction strategy rather than simply having a larger model or using more compute power.
Key Considerations
The primary advantage of InduceKV is its "fixed-footprint" nature, meaning the memory usage remains constant regardless of how many tasks the model learns. While it does introduce a small amount of overhead—specifically, an extra pass to compute retrieval keys and slightly larger attention matrices during the prefill stage—it avoids the complexities of parameter-space updates. This makes it a scalable solution for deploying MLLMs in environments where compute resources and storage are strictly limited.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!