Gated DeltaNet-2: Decoupling Erase and Write in Lin...

Gated DeltaNet-2: Decoupling Erase and Write in Lin... | AI Research

Key Takeaways

Gated DeltaNet-2 is a new recurrent attention layer designed to improve how large language models manage their internal memory.
Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory.
The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations.
Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay.
But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side.

Paper AbstractExpand

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at this https URL .

Gated DeltaNet-2 is a new recurrent attention layer designed to improve how large language models manage their internal memory. While standard Transformers use a massive, growing cache to remember information, linear attention models use a fixed-size "state" to store data. This makes them much faster and more memory-efficient, but it creates a bottleneck: the model struggles to update its memory without accidentally overwriting or "scrambling" important information it already knows. Gated DeltaNet-2 solves this by allowing the model to independently control how it removes old, irrelevant information and how it commits new, useful information.

The Problem with Tied Memory Updates

In previous models like Gated DeltaNet and Kimi Delta Attention (KDA), the process of "erasing" old memories and "writing" new ones was tied together by a single scalar value. This meant the model had to use the same gate to decide both how much old content to remove and how much new content to add. Because erasing and writing are fundamentally different operations—one happens on the key side of the memory and the other on the value side—this "scalar tie" acted as a significant restriction, often leading to interference between competing pieces of information.

Decoupling Erase and Write

Gated DeltaNet-2 introduces a more flexible approach called the "Gated Delta Rule-2." Instead of a single scalar, the model uses two separate, channel-wise gates: an erase gate and a write gate. This allows the model to be highly surgical. It can clear out specific stale associations from the memory using the erase gate while simultaneously inserting only the specific value channels that need to be preserved using the write gate. By separating these roles, the model can maintain a much cleaner and more accurate internal state, even when dealing with long sequences of complex information.

Stronger Performance in Long-Context Tasks

When tested on a 1.3B parameter scale, Gated DeltaNet-2 outperformed several established models, including Mamba-2, Gated DeltaNet, and KDA. Its advantages were most apparent in long-context retrieval tasks, such as the "needle-in-a-haystack" benchmark. The model proved particularly effective at multi-key retrieval, where it had to distinguish between many competing associations stored in its fixed-size memory. These results suggest that by decoupling the memory edit process, the model significantly reduces the interference that typically plagues fixed-state recurrent systems.

Efficient Training and Implementation

Despite the added complexity of having two separate gates, Gated DeltaNet-2 remains highly efficient for training. The researchers derived a "chunkwise" algorithm that allows the model to process data in parallel, similar to existing efficient delta-rule kernels. By absorbing the channel-wise decay into the erase factors, the model maintains a compact structure that maps well to modern hardware accelerators. This ensures that the performance gains in accuracy and memory management do not come at the cost of training speed.