Back to AI Research

AI Research

Action-Aware Generative Sequence Modeling for Short... | AI Research

Key Takeaways

  • Action-Aware Generative Sequence Modeling for Short Video Recommendation Modern recommendation systems often struggle with short videos because they treat ea...
  • With the rapid development of the Internet, users have increasingly higher expectations for the recommendation accuracy of online content consumption platforms.
  • However, short videos often contain diverse segments, and users may not hold the same attitude toward all of them.
  • Traditional binary-classification recommendation models, which treat a video as a single holistic entity, face limitations in accurately capturing such nuanced preferences.
  • Considering that user consumption is a temporal process, this paper demonstrates that the timing of user actions can represent diverse intentions through statistical analysis and examination of action patterns.
Paper AbstractExpand

With the rapid development of the Internet, users have increasingly higher expectations for the recommendation accuracy of online content consumption platforms. However, short videos often contain diverse segments, and users may not hold the same attitude toward all of them. Traditional binary-classification recommendation models, which treat a video as a single holistic entity, face limitations in accurately capturing such nuanced preferences. Considering that user consumption is a temporal process, this paper demonstrates that the timing of user actions can represent diverse intentions through statistical analysis and examination of action patterns. Based on this insight, we propose a novel modeling paradigm: Action-Aware Generative Sequence Network (A2Gen), which refines user actions along the temporal dimension and chains them into sequences for unified processing and prediction. First, we introduce the Context-aware Attention Module (CAM) to model action sequences enriched with item-specific contextual features. Building upon this, we develop the Hierarchical Sequence Encoder (HSE) to learn temporal action patterns from users' historical actions. Finally, through leveraging CAM, we design a module for action sequence generation: the Action-seq Autoregressive Generator (AAG). Extensive offline experiments on the Kuaishou's dataset and the Tmall public dataset demonstrate the superiority of our proposed model. Furthermore, through large-scale online A/B testing deployed on Kuaishou's platform, our model achieves significant improvements over baseline methods in multi-task prediction by leveraging sequential information. Specifically, it yields increases of 0.34% in user watch time, 8.1% in interaction rate, and 0.162% in overall user retention (LifeTime-7), leading to successful deployment across all traffic, serving over 400 million users every day.

Action-Aware Generative Sequence Modeling for Short Video Recommendation
Modern recommendation systems often struggle with short videos because they treat each video as a single, holistic unit. In reality, a video is composed of many different segments, and a user’s interest can shift significantly from one moment to the next. For example, a user might enjoy a specific highlight in a video but be indifferent to the rest. This paper introduces the Action-Aware Generative Sequence Network (A2Gen), a new modeling paradigm that treats user consumption as a temporal process. By analyzing the timing of specific actions—such as likes, comments, or follows—the model identifies which parts of a video truly resonate with a user, allowing for more accurate and personalized recommendations.

Capturing Fine-Grained User Intent

The core insight of this research is that user actions are not random; they are tied to specific moments in a video. Statistical analysis shows that user actions often cluster around video highlights. By chaining these actions into a time-ordered sequence, the model can distinguish between different user attitudes. For instance, a user who follows an author before liking a video shows a different intent than one who likes the video first. A2Gen captures these nuances by incorporating action timing into the modeling process, effectively decomposing the viewing experience into a sequence of meaningful signals rather than a single binary "like" or "dislike."

How A2Gen Works

A2Gen utilizes three primary components to process and predict user behavior:

  • Context-aware Attention Module (CAM): This module processes sequences by integrating item-specific contextual features. Unlike standard attention mechanisms, it ensures that the similarity between actions is calculated based on the specific content of the video, making the model more sensitive to the actual material being watched.

  • Hierarchical Sequence Encoder (HSE): This component learns from a user’s historical actions, extracting long-term patterns and habits to better inform current predictions.

  • Action-seq Autoregressive Generator (AAG): This module acts as the engine for prediction. It generates the user’s future action sequence step-by-step, predicting both the type of action (e.g., a "like") and the exact time it is likely to occur. By using an autoregressive approach, the model uses the context of previous actions to improve the accuracy of subsequent ones.

Real-World Impact and Results

The researchers tested A2Gen both offline using datasets from Kuaishou and Tmall and through large-scale online A/B testing on the Kuaishou platform. The results demonstrate that by leveraging the temporal structure of user actions, the model significantly outperforms traditional recommendation methods. In live production environments serving over 400 million users daily, the model achieved a 0.34% increase in watch time, an 8.1% improvement in interaction rates, and a 0.162% boost in user retention. These findings confirm that moving beyond holistic video modeling toward fine-grained, sequence-based generation is a highly effective strategy for modern content platforms.

Comments (0)

No comments yet

Be the first to share your thoughts!