TailorMind: Towards Preference-Aligned Multimodal Content Generation addresses the challenge of creating personalized content for users without relying on existing item pools or waiting for new user-generated content (UGC) to be published. While current generative models can create images or videos, they often struggle to translate a user's specific behavioral history into content that feels truly tailored to their unique tastes. This research introduces a framework that links collaborative preference modeling with controllable multimodal generation, ensuring that synthesized content is both high-quality and deeply aligned with individual user preferences.
Bridging Preferences and Generation
The core of TailorMind is its ability to turn sparse, noisy user interaction histories into actionable, natural-language profiles. It uses hypergraph collaborative filtering to enrich a user's history by identifying connections between their interests and broader community trends. Once an initial profile is created, the system uses a "textual gradient descent" process. By treating the user profile as a piece of text, the system iteratively refines it based on how well it predicts the user's actual interactions, effectively "training" the profile to be more accurate through feedback.
Ensuring Style and Cohesion
To prevent the generated content from drifting away from the user's intended style, TailorMind employs two key safeguards. First, it uses retrieval-augmented style control, which pulls examples from the user's own past interactions to ground the new content in authentic, familiar patterns. Second, it uses a cross-modal cohesion reflection mechanism. This acts as a quality check, monitoring the consistency between the generated text and visual elements (like images or videos) to ensure they remain semantically aligned and do not suffer from "semantic drift," where the output loses its original focus.
Benchmarking Performance
To evaluate these capabilities, the researchers developed TailorBench, a new benchmark derived from real-world data from platforms like Rednote, Bilibili, and Hupu. The framework is evaluated across five dimensions: coherence, novelty, aesthetic quality, hallucination (the absence of fabricated information), and profiling accuracy. Experimental results show that TailorMind outperforms representative generation baselines, achieving higher aesthetic quality and novelty while maintaining strong cross-modal coherence. Additionally, the system demonstrated significant gains in reranking performance, proving that it can successfully synthesize content that is more relevant to users than simply retrieving existing items.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!