Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling
This paper proposes a new way to think about how Transformer models understand the relationship between items in a sequence. Traditionally, models use "Rotary Positional Embeddings" (RoPE) to track the order of items using fixed, hand-crafted mathematical structures. The authors argue that this rotation space is actually an untapped, high-capacity dimension of the attention mechanism. By treating this space as a learnable, signal-conditioned environment rather than a fixed grid, the model can better capture complex, real-world dynamics like time and context.
The "Imaginary" Dimension of Attention
The authors draw an analogy to complex numbers in algebra. In a standard Transformer, the token embedding acts as the "real" component, defining what a token means. The authors propose that the rotation manifold should act as the "imaginary" component, defining how a token relates to others across time and context. By moving away from simple, fixed ordinal indices (like "first," "second," "third"), the model can instead use continuous signals—such as exact timestamps—to determine how items in a sequence interact.
How SIREN-RoPE Works
The researchers introduced a method called SIREN-RoPE to implement this idea. It uses a dual-branch neural network to process timestamps: * A Periodic Branch: Uses a Sinusoidal Representation Network (SIREN) to automatically discover cyclical patterns, such as daily or weekly user habits. * An Aperiodic Branch: Uses a standard neural network to capture linear trends, such as the natural decay of interest over time.
These branches are combined with a learnable gate that balances the importance of the new temporal signals against the traditional ordinal position. This allows the model to decide, through training, how much it should rely on the exact time of an interaction versus its position in a list.
Performance and Efficiency
The team tested this approach on a production-scale news feed dataset from a major social network. They found that SIREN-RoPE consistently outperformed standard models across multiple engagement tasks, including likes and sustained dwell time. Notably, the researchers found that when they tried to feed temporal data into the standard "semantic" embedding space, it often interfered with the model's performance. However, when that same data was routed through the "rotation" dimension, it provided a clear, independent boost to accuracy.
A New Perspective on Architecture
The authors suggest that the rotation space should no longer be viewed as a solved detail of positional encoding. Instead, it is a flexible axis that can be conditioned on various types of metadata. Because the SIREN-RoPE approach adds only about 0.2% more parameters and has a negligible impact on computational speed, it offers a highly efficient way to make Transformer models more context-aware. The authors invite the community to explore this "hidden dimension" to improve how models handle sequences in fields ranging from recommendation systems to natural language processing.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!