WinDOM: Self-Family Distillation for Small-Model GUI Grounding
Small GUI-grounding agents (around 2 billion parameters) are essential for running AI directly on devices, but they struggle with two major hurdles: the high cost of human-annotated training data and the lack of a standardized way to combine supervised fine-tuning with reinforcement learning. This paper introduces a new pipeline to address these issues, focusing on improving the performance of small models rather than simply increasing their size.
Creating Data Without Human Effort
The researchers developed WinDOM, a dataset containing over 54,000 records for training GUI agents. Instead of relying on expensive human annotators or noisy OCR (optical character recognition) tools, the team used a browser-based reimplementation of Windows 11. By running this environment headlessly, they could extract precise bounding boxes directly from the document layout (the DOM). This approach ensures that the training data is accurate and can be automatically generated from scratch using a single seed, making the process highly reproducible.
A Unified Approach to Training
To improve how these models learn, the authors introduced Self-Family Distillation (SFD). This method uses a "cold-start" phase where a student model learns from a teacher. The teacher can either be a larger, frozen model from the same family or an exponential moving average (EMA) of the student itself. In both cases, the teacher is given a "hint" (the ground-truth bounding box) to guide the student. If the teacher’s output misses the target, the example is rejected. This ensures the student only learns from high-quality, successful demonstrations.
Boosting Performance with RL
After the cold-start phase, the models undergo reinforcement learning using Group Relative Policy Optimization (GRPO). The researchers discovered that the "saturation depth"—how long the model is trained during the cold-start phase—is a critical hyperparameter. They found that initializing the reinforcement learning process from an "under-saturated" checkpoint (before the model has fully converged) leads to better results than starting from a fully trained one. This "Early-init RL" approach significantly improved performance on out-of-distribution benchmarks, such as ScreenSpot-Pro and OSWorld-G, compared to the base model.
Key Takeaways
The study demonstrates that small models can achieve high-quality GUI grounding without needing massive external datasets or complex, learned reward models. By using deterministic, geometric checks (like verifying if a click lands inside a box) instead of LLM-based judges, the entire training pipeline remains cheap to audit and reproduce. Furthermore, the "same-size" EMA mode of distillation proved to be nearly as effective as using a larger teacher, offering a powerful way to improve performance without requiring an additional, larger model.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!