Large Language Models (LLMs) are highly effective when fine-tuned on domain-specific data, but gathering high-quality data for specialized fields is difficult. Current methods often require experts to write explicit, natural language instructions to guide data generation. This paper introduces DOMINO, a framework that shifts from this "deductive" approach to an "inductive" one. Instead of requiring formal definitions, DOMINO learns the characteristics of a target domain by observing a small set of reference examples, allowing it to synthesize new, diverse data without manual prompt engineering.
Learning from Examples, Not Instructions
Traditional data synthesis relies on human-written prompts to define a domain. This breaks down in real-world scenarios where domain knowledge is implicit, such as proprietary business logic, emerging scientific fields, or niche cultural trends that are difficult to articulate in words. DOMINO addresses this by treating domain adaptation like human learning: it observes a handful of canonical examples and induces the underlying rules and patterns of the domain. By doing so, it enables the creation of virtually unlimited synthetic data for domains that lack formal documentation.
The Power of Minimal Sufficient Representation
A common pitfall when learning from a small set of examples is overfitting, where a model simply memorizes the specific details of the training data rather than learning the broader domain principles. To solve this, DOMINO uses a two-part strategy:
Prompt Tuning: It uses soft tokens to capture the general domain characteristics from reference samples.
Contrastive Disentanglement: It separates these shared domain patterns from the unique, "noisy" details of individual samples.
By optimizing for both reconstruction fidelity and information minimality, the framework ensures the model focuses on the "first principles" of the domain. This prevents the model from mistaking incidental stylistic choices or specific facts for the core rules of the domain.
Expanding Data Diversity
Theoretically, the authors prove that DOMINO’s approach expands the "support" of the synthetic data distribution. In simpler terms, because the model is forced to discard sample-specific noise, it is less likely to produce repetitive or near-identical copies of the reference data. This leads to a more diverse set of synthetic samples that better cover the potential range of the target domain.
Proven Performance in Coding
The researchers tested DOMINO on challenging coding benchmarks where domain definitions are typically implicit. Fine-tuning models on data synthesized by DOMINO improved Pass@1 accuracy by up to 4.63% compared to strong, instruction-tuned baselines. These results demonstrate that the framework is a robust, scalable solution for domain adaptation, effectively bridging the gap between limited reference data and the need for high-quality, diverse training sets.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!