Back to AI Research

AI Research

Where does Absolute Position come from in decoder-o... | AI Research

Key Takeaways

  • Where does Absolute Position come from in decoder-only Transformers?
  • This paper investigates a persistent mystery in modern AI: why do Transformers trained w...
  • RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product.
  • We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction.
  • The residual stream supplies the second.
Paper AbstractExpand

RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction. The residual stream supplies the second. Under causal attention the activation at position $0$ attends only to itself and runs as a closed dynamical system from the embedding of the token at that position; downstream attention reads this trajectory through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance: NTK scaling suppresses the residual-stream component, sliding-window attention allows it to accumulate with depth, and standard RoPE sits between. Replacing the \texttt{BOS} embedding before the forward pass removes $40\%$ of the residual-stream component at early queries. Attention sinks are token-anchored stabilizers that pass forward a deterministic fingerprint of the token at position $0$, constant across inputs when that token is the auto-prepended \texttt{BOS} and varying with it otherwise.

Where does Absolute Position come from in decoder-only Transformers?
This paper investigates a persistent mystery in modern AI: why do Transformers trained with Rotary Position Embeddings (RoPE) still "know" the absolute position of tokens in a sequence? RoPE is designed to encode only the relative distance between tokens, meaning the model should theoretically be unable to distinguish between a token at position 50 and the same token at position 100. However, empirical evidence shows that these models do distinguish absolute positions. The authors trace this "leakage" of positional information to two specific architectural features rather than the RoPE mechanism itself.

The Two Sources of Positional Leakage

The researchers identify two primary components that introduce absolute position into the model's attention patterns. The first is the causal mask, which is used to prevent the model from "looking ahead" at future tokens. Because the softmax denominator in the attention mechanism sums over all preceding tokens, the number of items being summed changes depending on the current position. This creates a mathematical dependency on the absolute position of the query.
The second source is the residual stream. In a causal Transformer, the token at position 0 attends only to itself. This creates a "closed dynamical system" where the activation at position 0 becomes a deterministic function of the initial embedding. This information then flows through the network and is read by downstream "sink-reading" heads, which pass a fingerprint of the starting token forward.

The Role of Attention Sinks

The study clarifies the function of "attention sinks"—the tokens at the start of a sequence that attract a disproportionate amount of attention. Contrary to the idea that these sinks act as information aggregators, the authors find they are actually "content-free stabilizers." They are anchored to specific tokens (like the auto-prepended BOS token) rather than specific positions. When the researchers moved the BOS token to a different position, the sink moved with it. These sinks serve as a reference frame, passing a deterministic signal about the starting token to the rest of the model.

Architectural Differences and Impact

The authors analyzed three different models—Llama, Qwen, and Mistral—and found that while the leakage mechanism is consistent, its impact varies based on the architecture:

  • NTK scaling tends to suppress the residual-stream component of the leakage.

  • Sliding-window attention allows the positional signal to accumulate more significantly as the model processes deeper layers.

  • Standard RoPE models fall somewhere in between these two behaviors.
    The researchers also performed a "bidirectional ablation," which removes the causal mask. This intervention significantly reduced the positional leakage, confirming that the causal mask is a major contributor. Furthermore, they found that replacing the BOS token before the forward pass removed 40% of the residual-stream component at early queries, proving that the identity of the starting token is a key driver of this positional awareness.

Comments (0)

No comments yet

Be the first to share your thoughts!