Where does Absolute Position come from in decoder-only Transformers?
This paper investigates a persistent mystery in modern AI: why do Transformers trained with Rotary Position Embeddings (RoPE) still "know" the absolute position of tokens in a sequence? RoPE is designed to encode only the relative distance between tokens, meaning the model should theoretically be unable to distinguish between a token at position 50 and the same token at position 100. However, empirical evidence shows that these models do distinguish absolute positions. The authors trace this "leakage" of positional information to two specific architectural features rather than the RoPE mechanism itself.
The Two Sources of Positional Leakage
The researchers identify two primary components that introduce absolute position into the model's attention patterns. The first is the causal mask, which is used to prevent the model from "looking ahead" at future tokens. Because the softmax denominator in the attention mechanism sums over all preceding tokens, the number of items being summed changes depending on the current position. This creates a mathematical dependency on the absolute position of the query.
The second source is the residual stream. In a causal Transformer, the token at position 0 attends only to itself. This creates a "closed dynamical system" where the activation at position 0 becomes a deterministic function of the initial embedding. This information then flows through the network and is read by downstream "sink-reading" heads, which pass a fingerprint of the starting token forward.
The Role of Attention Sinks
The study clarifies the function of "attention sinks"—the tokens at the start of a sequence that attract a disproportionate amount of attention. Contrary to the idea that these sinks act as information aggregators, the authors find they are actually "content-free stabilizers." They are anchored to specific tokens (like the auto-prepended BOS token) rather than specific positions. When the researchers moved the BOS token to a different position, the sink moved with it. These sinks serve as a reference frame, passing a deterministic signal about the starting token to the rest of the model.
Architectural Differences and Impact
The authors analyzed three different models—Llama, Qwen, and Mistral—and found that while the leakage mechanism is consistent, its impact varies based on the architecture:
NTK scaling tends to suppress the residual-stream component of the leakage.
Sliding-window attention allows the positional signal to accumulate more significantly as the model processes deeper layers.
Standard RoPE models fall somewhere in between these two behaviors.
The researchers also performed a "bidirectional ablation," which removes the causal mask. This intervention significantly reduced the positional leakage, confirming that the causal mask is a major contributor. Furthermore, they found that replacing the BOS token before the forward pass removed 40% of the residual-stream component at early queries, proving that the identity of the starting token is a key driver of this positional awareness.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!