Hugging Face has officially released ml-intern, an open-source AI agent designed to autonomously manage the end-to-end post-training workflow for large language models. Built on the company’s smolagents framework, the tool is engineered to replicate the daily research loop of an ML engineer, handling complex tasks ranging from literature reviews and dataset discovery to script execution and iterative model evaluation.
Automating the Research Lifecycle
The agent functions as a continuous, autonomous loop that mirrors the methodology of human machine learning researchers. It begins by browsing arXiv and Hugging Face Papers to analyze methodology sections and citation graphs, identifying relevant techniques and datasets. Once it locates datasets on the Hugging Face Hub, the agent inspects their quality and reformats them for specific training requirements.
When local compute resources are unavailable, ml-intern can autonomously launch jobs via Hugging Face Jobs. The system relies on Trackio, a Hub-native experiment tracker, to monitor progress. After each training run, the agent evaluates the output, diagnoses potential failures—such as reward collapse in reinforcement learning pipelines—and iterates on the process until benchmark performance improves.
Performance and Benchmarking
In evaluations using PostTrainBench, a benchmark developed by researchers at the University of Tübingen and the Max Planck Institute, ml-intern demonstrated significant efficiency. The benchmark tests an agent's ability to post-train a base model within a 10-hour window on a single H100 GPU. In a launch demo, the agent improved the Qwen3-1.7B base model’s GPQA score from approximately 10% to 32% in under 10 hours, reaching the 27.5% mark in just over three hours.
These results position the agent as a highly data-efficient tool. The performance of ml-intern surpassed that of Claude Code, which holds a 22.99% benchmark on the same task. By extracting 32% performance from a 1.7B parameter model, the agent demonstrates capabilities that often exceed the manual efforts of researchers working within similar time constraints.
Advanced Technical Strategies
Beyond standard fine-tuning, ml-intern utilizes sophisticated technical strategies to optimize model performance. In a healthcare-domain test, the agent identified insufficient data quality and autonomously wrote a script to generate synthetic training examples, focusing on edge cases such as multilingual emergency responses and medical hedging language. It then upsampled this data to improve performance on HealthBench.
The agent also demonstrated the ability to implement Group Relative Policy Optimization (GRPO) in a math-domain test. By utilizing GRPO, which offers lower memory overhead than standard PPO, the agent launched training on A100 GPUs, monitored reward curves, and performed ablations to isolate the most effective components before finalizing the model checkpoint.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!