Generative Retrieval via Diffusion Transformer with Metric-Ordered Sequence Training and Hybrid-Policy Preference Optimization
This paper addresses a common challenge in large-scale retrieval systems: finding items that satisfy a specific attribute (such as safety or quality) while remaining consistent with a fine-grained "pattern" or style defined by a seed set of items. Standard retrieval methods often struggle with this balance; they either drift toward unrelated items that happen to share the target attribute or stay too close to the seed items, failing to find higher-quality examples. The authors introduce a framework called MO-DiT+HPPO, which uses continuous generative retrieval to synthesize a query embedding that moves toward higher attribute density while preserving the original pattern.
The Challenge of Pattern-Preserving Retrieval
In many production environments, users provide a small "seed set" of items to define a specific intent or style. The goal is to retrieve more items that match this style while also meeting a target attribute. The authors note that this creates a tension: simply averaging the seed embeddings keeps the pattern but results in low-quality attribute matches, while optimizing only for the attribute leads to "pattern drift," where the system retrieves relevant items that look nothing like the seeds. The researchers formalize this as "pattern-preserving attribute retrieval" and define a primary metric, Joint@K, to measure success in both areas simultaneously.
A Staged Training Framework
The MO-DiT+HPPO framework uses a four-stage pipeline to train a diffusion transformer to generate effective query embeddings:
- Raw-Sequence Pretraining: The model is first trained on large-scale data to learn a general prior for continuous retrieval. 2. Metric-Ordered Continuation Pretraining: The researchers use a lightweight predictor to rank items within latent pattern clusters based on their predicted attribute density. By training the model on sequences that move from low-density to high-density items, the model learns the "direction" of improvement across different domains. 3. Tail-Centroid Fine-Tuning: The model is fine-tuned to map a sequence of seed items to the "centroid" (the average) of a high-performing tail of items, which helps the model focus on high-quality results without being overly sensitive to a single noisy example. 4. Hybrid-Policy Preference Optimization (HPPO): This final stage aligns the model with the true online objective. It uses a "hybrid-policy" candidate pool—combining deterministic constructions with the model's own generated samples—and applies a Pareto filter. This filter ensures that updates only proceed if they improve the attribute metric without degrading the pattern purity.
Results and Performance
The researchers evaluated their framework across four large-scale attribute domains using strict item- and pattern-holdout protocols. The results showed that the metric-ordered training significantly improved the primary intersection metric (Joint@K) compared to a strong baseline. The addition of HPPO provided further gains, with the Pareto filter proving critical: it allowed the model to push the "attribute–pattern frontier" outward, meaning the system could achieve higher attribute density without sacrificing the consistency of the retrieved patterns.
Key Takeaways
The success of this approach relies on the distinction between the training process and the evaluation process. While the researchers use a lightweight predictor to order training sequences, all final metrics are calculated using real, online top-K vector retrieval. By combining continuous generation—which allows the model to synthesize a query that doesn't necessarily exist as a single item—with a filter that prevents pattern drift, the framework effectively navigates the trade-off between attribute-seeking and pattern-preservation.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!