Back to AI Research

AI Research

Utility-Aware Multimodal Contrastive Learning for P... | AI Research

Key Takeaways

  • Utility-Aware Multimodal Contrastive Learning for Product Image Generation This paper addresses a critical gap in current generative AI: while models are exc...
  • Product images strongly influence consumer decision-making in online marketplaces.
  • Empowered by multimodal contrastive learning, generative AI can output images that closely align with text prompts.
  • Yet existing generative AI models do not directly optimize marketplace performance.
  • This is a critical gap, since semantic alignment alone does not guarantee that an image will sell.
Paper AbstractExpand

Product images strongly influence consumer decision-making in online marketplaces. Empowered by multimodal contrastive learning, generative AI can output images that closely align with text prompts. Yet existing generative AI models do not directly optimize marketplace performance. This is a critical gap, since semantic alignment alone does not guarantee that an image will sell. To address this limitation, we propose a \textit{utility-aware multimodal contrastive learning} framework that incorporates consumer demand into a novel Utility-Aware InfoNCE loss. Optimizing this utility-aware objective guides generation toward images that are both semantically coherent and demand-enhancing. This effect arises directly from a shift in the learned image-text representation space toward demand-driven visual cues, which we also validate through the theoretical bound of the proposed objective. In downstream applications on Amazon and Airbnb, product images generated and edited by our method outperform state-of-the-art models in increasing demand and preserving fidelity, while maintaining text-image consistency. Notably, our utility-aware framework preserves inverse U-shaped demand patterns for attributes such as aesthetics and uniqueness, improving demand-based performance while preserving fidelity and semantic consistency. Human-subject experiments further validate its commercial effectiveness. As generative AI technology continues to evolve, our utility-aware component can be flexibly embedded into emerging generative models to improve direct commercial use.

Utility-Aware Multimodal Contrastive Learning for Product Image Generation
This paper addresses a critical gap in current generative AI: while models are excellent at creating images that match text descriptions, they are not designed to optimize for business outcomes like sales or bookings. The authors introduce a "Utility-Aware" framework that integrates consumer demand data directly into the image generation process. By shifting how AI models learn to represent images, this approach ensures that generated content is not only semantically accurate but also strategically designed to drive marketplace performance.

Bridging AI and Marketplace Success

Existing generative models prioritize aesthetic appeal and semantic alignment, often resulting in images that look "over-stylized" or unrealistic. In online marketplaces like Amazon or Airbnb, however, the most effective images are those that build consumer trust and influence purchasing decisions. The authors argue that current AI systems fail to distinguish between images that are merely eye-catching and those that actually drive conversions. Their framework solves this by treating image generation as a utility-maximization problem, where the AI is guided by real-world demand patterns rather than just visual beauty.

How the Utility-Aware Framework Works

The core of the research is a new "Utility-Aware InfoNCE loss" function. Standard AI models use a contrastive loss to measure how well an image matches a text prompt. The authors modify this by adding a "utility" component—a score derived from demand-driven visual attributes like uniqueness, brightness, or professional quality. By training the model with this combined objective, the AI learns to prioritize visual features that have been statistically linked to higher demand. This process is modular, meaning it can be embedded into existing generative models to steer them toward more commercially effective outputs without requiring the user to engage in complex or repetitive prompt engineering.

Proven Results in Real-World Applications

The researchers tested their framework on both Amazon and Airbnb platforms, comparing it against state-of-the-art models like Stable Diffusion and Flux. The results showed that the Utility-Aware Generator consistently outperformed these baselines in both image generation and editing tasks. Notably, the model successfully maintained an "inverse U-shaped" demand pattern—meaning it created images that were distinctive and appealing without becoming so over-polished that they lost their realism. Human-subject experiments confirmed these findings, with participants showing a higher preference for purchasing or booking products represented by the utility-aware images.

Managerial Implications

This framework offers a scalable solution for sellers and platforms to improve their visual content. Instead of relying on expensive professional photography or trial-and-error with AI prompts, businesses can use this tool to generate images that are scientifically aligned with consumer preferences. For platforms, the technology provides a systematic way to identify low-quality listings, suggest improvements, and conduct counterfactual testing to see how different visual presentations might impact sales before they are ever published.

Comments (0)

No comments yet

Be the first to share your thoughts!