AI Research

Domain-Specific Data Synthesis for LLMs via Minimal... | AI Research

Key Takeaways

Large Language Models (LLMs) are highly effective when fine-tuned on domain-specific data, but gathering high-quality data for specialized fields is difficul...
Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data.
However, acquiring high-quality data for target domains remains a significant challenge.
We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data.
DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics.

Paper AbstractExpand

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.

Large Language Models (LLMs) are highly effective when fine-tuned on domain-specific data, but gathering high-quality data for specialized fields is difficult. Current methods often require experts to write explicit, natural language instructions to guide data generation. This paper introduces DOMINO, a framework that shifts from this "deductive" approach to an "inductive" one. Instead of requiring formal definitions, DOMINO learns the characteristics of a target domain by observing a small set of reference examples, allowing it to synthesize new, diverse data without manual prompt engineering.

Learning from Examples, Not Instructions

Traditional data synthesis relies on human-written prompts to define a domain. This breaks down in real-world scenarios where domain knowledge is implicit, such as proprietary business logic, emerging scientific fields, or niche cultural trends that are difficult to articulate in words. DOMINO addresses this by treating domain adaptation like human learning: it observes a handful of canonical examples and induces the underlying rules and patterns of the domain. By doing so, it enables the creation of virtually unlimited synthetic data for domains that lack formal documentation.

The Power of Minimal Sufficient Representation

A common pitfall when learning from a small set of examples is overfitting, where a model simply memorizes the specific details of the training data rather than learning the broader domain principles. To solve this, DOMINO uses a two-part strategy:

Prompt Tuning: It uses soft tokens to capture the general domain characteristics from reference samples.
Contrastive Disentanglement: It separates these shared domain patterns from the unique, "noisy" details of individual samples.
By optimizing for both reconstruction fidelity and information minimality, the framework ensures the model focuses on the "first principles" of the domain. This prevents the model from mistaking incidental stylistic choices or specific facts for the core rules of the domain.

Expanding Data Diversity

Theoretically, the authors prove that DOMINO’s approach expands the "support" of the synthetic data distribution. In simpler terms, because the model is forced to discard sample-specific noise, it is less likely to produce repetitive or near-identical copies of the reference data. This leads to a more diverse set of synthetic samples that better cover the potential range of the target domain.

Proven Performance in Coding

The researchers tested DOMINO on challenging coding benchmarks where domain definitions are typically implicit. Fine-tuning models on data synthesized by DOMINO improved Pass@1 accuracy by up to 4.63% compared to strong, instruction-tuned baselines. These results demonstrate that the framework is a robust, scalable solution for domain adaptation, effectively bridging the gap between limited reference data and the need for high-quality, diverse training sets.

Comments (0)

No comments yet

Be the first to share your thoughts!