Action-Aware Generative Sequence Modeling for Short Video Recommendation
Modern recommendation systems often struggle with short videos because they treat each video as a single, holistic unit. In reality, a video is composed of many different segments, and a user’s interest can shift significantly from one moment to the next. For example, a user might enjoy a specific highlight in a video but be indifferent to the rest. This paper introduces the Action-Aware Generative Sequence Network (A2Gen), a new modeling paradigm that treats user consumption as a temporal process. By analyzing the timing of specific actions—such as likes, comments, or follows—the model identifies which parts of a video truly resonate with a user, allowing for more accurate and personalized recommendations.
Capturing Fine-Grained User Intent
The core insight of this research is that user actions are not random; they are tied to specific moments in a video. Statistical analysis shows that user actions often cluster around video highlights. By chaining these actions into a time-ordered sequence, the model can distinguish between different user attitudes. For instance, a user who follows an author before liking a video shows a different intent than one who likes the video first. A2Gen captures these nuances by incorporating action timing into the modeling process, effectively decomposing the viewing experience into a sequence of meaningful signals rather than a single binary "like" or "dislike."
How A2Gen Works
A2Gen utilizes three primary components to process and predict user behavior:
Context-aware Attention Module (CAM): This module processes sequences by integrating item-specific contextual features. Unlike standard attention mechanisms, it ensures that the similarity between actions is calculated based on the specific content of the video, making the model more sensitive to the actual material being watched.
Hierarchical Sequence Encoder (HSE): This component learns from a user’s historical actions, extracting long-term patterns and habits to better inform current predictions.
Action-seq Autoregressive Generator (AAG): This module acts as the engine for prediction. It generates the user’s future action sequence step-by-step, predicting both the type of action (e.g., a "like") and the exact time it is likely to occur. By using an autoregressive approach, the model uses the context of previous actions to improve the accuracy of subsequent ones.
Real-World Impact and Results
The researchers tested A2Gen both offline using datasets from Kuaishou and Tmall and through large-scale online A/B testing on the Kuaishou platform. The results demonstrate that by leveraging the temporal structure of user actions, the model significantly outperforms traditional recommendation methods. In live production environments serving over 400 million users daily, the model achieved a 0.34% increase in watch time, an 8.1% improvement in interaction rates, and a 0.162% boost in user retention. These findings confirm that moving beyond holistic video modeling toward fine-grained, sequence-based generation is a highly effective strategy for modern content platforms.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!