Back to AI Research

AI Research

A Hippocampus for Linear Attention: An Exact Memory... | AI Research

Key Takeaways

  • A Hippocampus for Linear Attention: An Exact Memory for What the Recurrent State Forgets Linear-attention and state-space models have become popular because...
  • Inspired by Complementary Learning Systems, we give linear attention a hippocampal complement.
  • At 340M parameters trained on 15B SlimPajama tokens, HOLA lowers Wikitext perplexity from 27.32 to 22.92 (-16.1%), below a full-attention Transformer++ (26.88), and improves LAMBADA perplexity from 30.95 to 30.26.
  • It also achieves the best linear in-context retrieval and remains much more robust than GDN or a matched HOLA+recency cache on RULER needle-in-a-haystack recall out to 32k tokens (16x its training length).
  • A Hippocampus for Linear Attention: An Exact Memory for What the Recurrent State Forgets
Paper AbstractExpand

Linear-attention and state-space language models compress the prefix into a fixed-size recurrent state, yielding O(1) memory at the cost of a lossy exact memory: when many key--value associations compete, earlier facts are overwritten and needle recall degrades. Inspired by Complementary Learning Systems, we give linear attention a hippocampal complement. HOLA (Hippocampal Linear Attention) keeps the usual delta-rule state as a compressive memory and adds a bounded exact KV cache, forming a semiparametric test-time memory: the state models linearly compressible structure, while the cache stores associations that should not be forced through that state. The cache writes without a learned eviction module, keeping tokens with large beta * ||e||, the prediction residual actually committed to the state; a decoupled RMSNorm-gamma cache read then turns these exact KV pairs into sharp retrieval rather than soft averaging. At 340M parameters trained on 15B SlimPajama tokens, HOLA lowers Wikitext perplexity from 27.32 to 22.92 (-16.1%), below a full-attention Transformer++ (26.88), and improves LAMBADA perplexity from 30.95 to 30.26. It also achieves the best linear in-context retrieval and remains much more robust than GDN or a matched HOLA+recency cache on RULER needle-in-a-haystack recall out to 32k tokens (16x its training length).

A Hippocampus for Linear Attention: An Exact Memory for What the Recurrent State Forgets
Linear-attention and state-space models have become popular because they can process long sequences of text with constant memory usage. However, they achieve this efficiency by compressing all information into a single, fixed-size recurrent state. This process is inherently lossy; as the model encounters new information, it overwrites older facts, leading to poor performance on tasks that require precise, one-shot recall of specific details. This paper introduces HOLA (Hippocampal Linear Attention), a system that adds a "hippocampal" exact memory to the "neocortex" of the recurrent state, allowing the model to store and retrieve important information without relying solely on lossy compression.

The Semiparametric Approach

HOLA is built on the concept of Complementary Learning Systems (CLS) theory, which suggests that biological brains use two distinct memory systems: one for slow, generalizable learning and another for fast, exact, one-shot memory. HOLA treats the recurrent state as a parametric estimator that models the general structure of the data, while a bounded, exact Key-Value (KV) cache acts as a non-parametric correction. This allows the model to keep the efficiency of linear attention while gaining the ability to perform sharp, accurate retrieval of specific tokens that the recurrent state might otherwise discard.

Selecting What to Remember

A major challenge in creating a bounded memory is deciding which information to keep. While many existing models simply store the most recent tokens, this approach fails when important information is far back in the text. HOLA uses an "intrinsic surprise" signal already generated by the model's delta-rule update. By calculating the product of the write strength and the prediction residual, the model identifies tokens that it struggled to predict and that caused a significant change to the recurrent state. These "surprising" tokens are prioritized for the exact cache, ensuring that the model retains the most critical information regardless of how long ago it appeared.

Sharp Retrieval via Decoupled Normalization

Simply storing exact copies of tokens is not enough if the model reads them using the same methods as its compressed state, which often results in "soft" averaging that blurs the retrieved information. HOLA solves this by using a decoupled RMSNorm-gamma process specifically for the cache read path. This technique allows the model to perform sharp, near-argmax retrieval from the cache while keeping the recurrent state's internal math stable. This separation is key to the model's performance, as it prevents the cache from degenerating into another lossy summary.

Performance and Robustness

When tested at a 340M parameter scale, HOLA significantly outperforms standard linear-attention models. It lowers Wikitext perplexity from 27.32 to 22.92, even surpassing a full-attention Transformer++ model. Furthermore, HOLA demonstrates impressive length-robustness; while other models struggle with "needle-in-a-haystack" retrieval as context lengths grow, HOLA remains effective out to 32k tokens—16 times its training length. Despite these gains, the system remains highly efficient, with the exact memory adding only a small amount of overhead to the model's total memory usage.

Comments (0)

No comments yet

Be the first to share your thoughts!