A Hippocampus for Linear Attention: An Exact Memory for What the Recurrent State Forgets
Linear-attention and state-space models have become popular because they can process long sequences of text with constant memory usage. However, they achieve this efficiency by compressing all information into a single, fixed-size recurrent state. This process is inherently lossy; as the model encounters new information, it overwrites older facts, leading to poor performance on tasks that require precise, one-shot recall of specific details. This paper introduces HOLA (Hippocampal Linear Attention), a system that adds a "hippocampal" exact memory to the "neocortex" of the recurrent state, allowing the model to store and retrieve important information without relying solely on lossy compression.
The Semiparametric Approach
HOLA is built on the concept of Complementary Learning Systems (CLS) theory, which suggests that biological brains use two distinct memory systems: one for slow, generalizable learning and another for fast, exact, one-shot memory. HOLA treats the recurrent state as a parametric estimator that models the general structure of the data, while a bounded, exact Key-Value (KV) cache acts as a non-parametric correction. This allows the model to keep the efficiency of linear attention while gaining the ability to perform sharp, accurate retrieval of specific tokens that the recurrent state might otherwise discard.
Selecting What to Remember
A major challenge in creating a bounded memory is deciding which information to keep. While many existing models simply store the most recent tokens, this approach fails when important information is far back in the text. HOLA uses an "intrinsic surprise" signal already generated by the model's delta-rule update. By calculating the product of the write strength and the prediction residual, the model identifies tokens that it struggled to predict and that caused a significant change to the recurrent state. These "surprising" tokens are prioritized for the exact cache, ensuring that the model retains the most critical information regardless of how long ago it appeared.
Sharp Retrieval via Decoupled Normalization
Simply storing exact copies of tokens is not enough if the model reads them using the same methods as its compressed state, which often results in "soft" averaging that blurs the retrieved information. HOLA solves this by using a decoupled RMSNorm-gamma process specifically for the cache read path. This technique allows the model to perform sharp, near-argmax retrieval from the cache while keeping the recurrent state's internal math stable. This separation is key to the model's performance, as it prevents the cache from degenerating into another lossy summary.
Performance and Robustness
When tested at a 340M parameter scale, HOLA significantly outperforms standard linear-attention models. It lowers Wikitext perplexity from 27.32 to 22.92, even surpassing a full-attention Transformer++ model. Furthermore, HOLA demonstrates impressive length-robustness; while other models struggle with "needle-in-a-haystack" retrieval as context lengths grow, HOLA remains effective out to 32k tokens—16 times its training length. Despite these gains, the system remains highly efficient, with the exact memory adding only a small amount of overhead to the model's total memory usage.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!