Learning to Rotate: Temporal and Semantic Rotary En...

Learning to Rotate: Temporal and Semantic Rotary En... | AI Research

Key Takeaways

Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling This paper proposes a new way to think about how Transformer models underst...
We argue that this rotation space is a largely overlooked second dimension of expressivity in the attention mechanism, one whose systematic exploration may open a new door for attention-based architectures.
Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling This paper proposes a new way to think about how Transformer models understand the relationship between items in a sequence.
Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling
This paper proposes a new way to think about how Transformer models understand the relationship between items in a sequence.

Paper AbstractExpand

Every Transformer architecture dedicates enormous capacity to learning rich representations in semantic embedding space -- yet the rotation manifold acted upon by Rotary Positional Embeddings (RoPE) has been treated as a fixed, hand-crafted structure, populated only by discrete ordinal indices. We argue that this rotation space is a largely overlooked second dimension of expressivity in the attention mechanism, one whose systematic exploration may open a new door for attention-based architectures. The analogy to complex numbers is instructive: just as introducing the imaginary axis -- orthogonal to and independent of the real line -- unlocked new algebraic structure once believed impossible, treating the rotation manifold as a learnable, signal-conditioned space opens an orthogonal degree of freedom in attention. In this framing, the token embedding encodes the semantic (real) component of a representation -- what a token means -- while the rotation encodes its dynamic (imaginary) component -- how it relates to every other token across time, position, and context. We introduce SIREN-RoPE, a concrete instantiation of this idea, which populates the rotation dimension with heterogeneous signals -- continuous timestamps, cyclical temporal patterns, and categorical metadata -- via a dual-branch Sinusoidal Representation Network (SIREN). As a proof of concept, we evaluate on a production-scale news feed dataset from a major social network using a generative recommender as the ranking model, demonstrating that activating this hidden dimension yields consistent improvements across calibration and ranking objectives with negligible computational overhead. We invite the community to view the rotation space not as a solved positional-encoding detail, but as an untapped axis whose rich structure may prove as consequential for attention as the imaginary unit proved for algebra.

Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling
This paper proposes a new way to think about how Transformer models understand the relationship between items in a sequence. Traditionally, models use "Rotary Positional Embeddings" (RoPE) to track the order of items using fixed, hand-crafted mathematical structures. The authors argue that this rotation space is actually an untapped, high-capacity dimension of the attention mechanism. By treating this space as a learnable, signal-conditioned environment rather than a fixed grid, the model can better capture complex, real-world dynamics like time and context.

The "Imaginary" Dimension of Attention

The authors draw an analogy to complex numbers in algebra. In a standard Transformer, the token embedding acts as the "real" component, defining what a token means. The authors propose that the rotation manifold should act as the "imaginary" component, defining how a token relates to others across time and context. By moving away from simple, fixed ordinal indices (like "first," "second," "third"), the model can instead use continuous signals—such as exact timestamps—to determine how items in a sequence interact.

How SIREN-RoPE Works

The researchers introduced a method called SIREN-RoPE to implement this idea. It uses a dual-branch neural network to process timestamps: * A Periodic Branch: Uses a Sinusoidal Representation Network (SIREN) to automatically discover cyclical patterns, such as daily or weekly user habits. * An Aperiodic Branch: Uses a standard neural network to capture linear trends, such as the natural decay of interest over time.
These branches are combined with a learnable gate that balances the importance of the new temporal signals against the traditional ordinal position. This allows the model to decide, through training, how much it should rely on the exact time of an interaction versus its position in a list.

Performance and Efficiency

The team tested this approach on a production-scale news feed dataset from a major social network. They found that SIREN-RoPE consistently outperformed standard models across multiple engagement tasks, including likes and sustained dwell time. Notably, the researchers found that when they tried to feed temporal data into the standard "semantic" embedding space, it often interfered with the model's performance. However, when that same data was routed through the "rotation" dimension, it provided a clear, independent boost to accuracy.

A New Perspective on Architecture

The authors suggest that the rotation space should no longer be viewed as a solved detail of positional encoding. Instead, it is a flexible axis that can be conditioned on various types of metadata. Because the SIREN-RoPE approach adds only about 0.2% more parameters and has a negligible impact on computational speed, it offers a highly efficient way to make Transformer models more context-aware. The authors invite the community to explore this "hidden dimension" to improve how models handle sequences in fields ranging from recommendation systems to natural language processing.