Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding introduces a new framework, GUI-SD, designed to improve how autonomous agents interact with computer interfaces. GUI grounding—the ability to map natural language instructions to specific visual coordinates on a screen—is a fundamental skill for AI agents. While current methods often rely on reinforcement learning techniques like GRPO, these approaches are computationally expensive because they require multiple attempts (rollouts) to learn, and they often struggle to provide useful feedback on difficult tasks. GUI-SD addresses these issues by using a self-distillation approach that provides dense, step-by-step guidance from a single attempt, making training both faster and more accurate.
The Problem with Current Methods
Existing reinforcement learning methods for GUI grounding often fail when a task is difficult, as they may receive zero feedback across all attempts, leaving the model with no way to improve. Other approaches, such as "naive" self-distillation, attempt to solve this by having the model act as both a teacher and a student. However, this often leads to two major failures: "Distillation-to-SFT Collapse," where the teacher becomes too rigid and essentially forces the student to memorize answers rather than learn, and "Indiscriminate Optimization," where the model treats all parts of a coordinate equally, failing to prioritize the most important digits that determine a click's accuracy.
How GUI-SD Works
GUI-SD solves these problems through two primary innovations. First, it uses "Visual Privileged Guidance." Instead of just giving the teacher the exact text coordinates of a target, the system provides a visual hint—a bounding box around the target and a Gaussian soft mask that highlights the area of interest. This allows the teacher to guide the student toward the correct location without simply handing over the answer, which keeps the learning process flexible and effective.
Second, it employs "Entropy-Guided Optimization." This technique recognizes that not all coordinate digits are equally important. For example, an error in the hundreds digit of a coordinate is much more significant than an error in the units digit. GUI-SD assigns higher importance to these significant digits and uses the teacher's own confidence levels to filter out unreliable signals. By focusing on the most impactful and certain information, the model learns more efficiently.
Performance and Efficiency
Extensive testing across six major GUI grounding benchmarks shows that GUI-SD consistently outperforms both standard reinforcement learning methods and naive self-distillation. The framework achieves higher accuracy in complex tasks, such as identifying small targets on high-resolution screens. Furthermore, because GUI-SD requires only a single rollout rather than multiple, it is significantly more efficient, training approximately four times faster per epoch than traditional GRPO-based methods. These results suggest that providing dense, entropy-aware supervision is a highly effective way to train agents for precise, real-world computer interaction.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!