Back to AI Research

AI Research

MIRA: Mid-training Rubric Anchoring for Source-Awar... | AI Research

Key Takeaways

  • MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection Modern Large Language Models (LLMs) often undergo a "mid-training" phase—a critical stage...
  • Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training.
  • As a result, effective selection requires both scalability and source-adaptive semantic criteria.
  • Existing model-based methods scale well, but provide only implicit quality signals.
  • Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats.
Paper AbstractExpand

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection
Modern Large Language Models (LLMs) often undergo a "mid-training" phase—a critical stage where models are trained on large, diverse datasets to sharpen specific skills like coding or reasoning before final fine-tuning. Because these datasets are highly heterogeneous—mixing everything from raw web documents to complex agent trajectories—it is difficult to select the best data for training. Existing methods either use simple, broad filters that lack nuance or complex, fixed rubrics that don't account for the unique requirements of different data types. MIRA (Mid-training Rubric Anchoring) solves this by automatically discovering what makes data "high quality" for specific groups of sources and using that knowledge to filter the entire corpus efficiently.

Discovering Quality Through Rubrics

Instead of applying a single, one-size-fits-all quality score to every piece of data, MIRA first organizes the training data into groups based on their content. For each group, the system uses a highly capable "frontier teacher" model to analyze a small sample of data. The teacher is asked to freely identify what constitutes quality for that specific group, such as identifying clear reasoning in a math problem or correct syntax in a code snippet. MIRA then clusters these observations into a set of "anchor rubrics." This allows the system to define quality based on the actual characteristics of the data rather than relying on human-defined rules that might not apply to every source.

Scaling Up with Student Scorers

Because using a frontier teacher model to evaluate tens of millions of records is computationally expensive, MIRA uses a distillation process. Once the anchor rubrics are established, the system uses them to generate structured labels for a larger sample of data. These labels are then used to train "student scorers"—lightweight, specialized models that can quickly evaluate the entire corpus. By training a separate student for each source group, MIRA ensures that the scoring remains accurate and tailored to the specific capability being taught, while remaining fast enough to handle massive datasets.

Calibrated Filtering

Even with specialized student scorers, some data sources or specific quality dimensions may be harder to predict than others. To prevent unreliable signals from skewing the training data, MIRA applies a "reliability mask." It compares the student’s scores against the teacher’s judgments on a validation set; if a specific dimension is consistently unreliable for a certain source, the system masks it out during the final aggregation. Finally, MIRA uses source-aware retention thresholds to select the best data. This ensures that the model maintains a balanced, high-quality mixture of data, preventing the system from accidentally discarding important, lower-scoring sources that are still vital for overall performance.

Performance and Impact

In experiments focused on code-oriented mid-training, MIRA demonstrated significant efficiency. By using its source-aware filtering, the researchers were able to achieve performance levels matching the full, unfiltered 50-billion-token corpus while using only half the data (25 billion tokens). Across nine different coding benchmarks, MIRA outperformed other common data selection methods, proving that a nuanced, source-adaptive approach to data quality is more effective than applying global, generic filters.

Comments (0)

No comments yet

Be the first to share your thoughts!