Back to AI Research

AI Research

WinDOM: Self-Family Distillation for Small-Model GU... | AI Research

Key Takeaways

  • WinDOM: Self-Family Distillation for Small-Model GUI Grounding Small GUI-grounding agents (around 2 billion parameters) are essential for running AI directly...
  • We address both, with the explicit goal of pushing small-model performance rather than scaling up.
  • WinDOM is a $54{,}425$-record grounding corpus harvested by driving an open-source Windows 11 web reimplementation under headless Playwright, with bounding boxes read directly off the DOM and no OCR or human annotation.
  • Self-Family Distillation (SFD) is a single rejection-sampling cold-start parameterised only by the teacher choice: either an EMA of the student (no external model) or a frozen larger same-family teacher.
  • We then treat the saturation depth of the SFD cold-start as an explicit GRPO hyperparameter.
Paper AbstractExpand

Small ($\sim$2B) GUI-grounding agents are attractive for on-device deployment, accessibility tooling, and low-cost iteration, but at this scale they face two open recipe questions: how to obtain bounding-box training data without expensive human annotation, and how to combine supervised fine-tuning with reinforcement learning. We address both, with the explicit goal of pushing small-model performance rather than scaling up. WinDOM is a $54{,}425$-record grounding corpus harvested by driving an open-source Windows 11 web reimplementation under headless Playwright, with bounding boxes read directly off the DOM and no OCR or human annotation. Self-Family Distillation (SFD) is a single rejection-sampling cold-start parameterised only by the teacher choice: either an EMA of the student (no external model) or a frozen larger same-family teacher. We then treat the saturation depth of the SFD cold-start as an explicit GRPO hyperparameter. On a Qwen3.5-2B student, the under-saturated cold-start is a better GRPO initialiser than the converged one: SFD-4B with Early-init RL gains $+5.4$ OOD-mean ($+3.5$ ScreenSpot-Pro, $+7.0$ OSWorld-G, $+5.8$ ScreenSpot-V2) over the base. The same-size EMA mode lands within roughly one OOD-mean point of the cross-size $4$B variant ($65.2$ vs $66.3$) without an external teacher.

WinDOM: Self-Family Distillation for Small-Model GUI Grounding
Small GUI-grounding agents (around 2 billion parameters) are essential for running AI directly on devices, but they struggle with two major hurdles: the high cost of human-annotated training data and the lack of a standardized way to combine supervised fine-tuning with reinforcement learning. This paper introduces a new pipeline to address these issues, focusing on improving the performance of small models rather than simply increasing their size.

Creating Data Without Human Effort

The researchers developed WinDOM, a dataset containing over 54,000 records for training GUI agents. Instead of relying on expensive human annotators or noisy OCR (optical character recognition) tools, the team used a browser-based reimplementation of Windows 11. By running this environment headlessly, they could extract precise bounding boxes directly from the document layout (the DOM). This approach ensures that the training data is accurate and can be automatically generated from scratch using a single seed, making the process highly reproducible.

A Unified Approach to Training

To improve how these models learn, the authors introduced Self-Family Distillation (SFD). This method uses a "cold-start" phase where a student model learns from a teacher. The teacher can either be a larger, frozen model from the same family or an exponential moving average (EMA) of the student itself. In both cases, the teacher is given a "hint" (the ground-truth bounding box) to guide the student. If the teacher’s output misses the target, the example is rejected. This ensures the student only learns from high-quality, successful demonstrations.

Boosting Performance with RL

After the cold-start phase, the models undergo reinforcement learning using Group Relative Policy Optimization (GRPO). The researchers discovered that the "saturation depth"—how long the model is trained during the cold-start phase—is a critical hyperparameter. They found that initializing the reinforcement learning process from an "under-saturated" checkpoint (before the model has fully converged) leads to better results than starting from a fully trained one. This "Early-init RL" approach significantly improved performance on out-of-distribution benchmarks, such as ScreenSpot-Pro and OSWorld-G, compared to the base model.

Key Takeaways

The study demonstrates that small models can achieve high-quality GUI grounding without needing massive external datasets or complex, learned reward models. By using deterministic, geometric checks (like verifying if a click lands inside a box) instead of LLM-based judges, the entire training pipeline remains cheap to audit and reproduce. Furthermore, the "same-size" EMA mode of distillation proved to be nearly as effective as using a larger teacher, offering a powerful way to improve performance without requiring an additional, larger model.

Comments (0)

No comments yet

Be the first to share your thoughts!