GUI agents—AI models designed to control computers by interacting with screens—have made significant progress in clicking on buttons and icons. However, these agents often struggle with "drag" actions, such as highlighting text, resizing windows, or moving sliders. These interactions are essential for many daily computer tasks, yet current datasets for training these models are too small to teach them how to perform these movements accurately.
Introducing DragOn
The researchers introduce DragOn, a new benchmark and training dataset specifically designed to improve how AI models handle drag-based interactions. The dataset covers four common GUI domains: text highlighting, spreadsheet cell selection, element resizing, and slider manipulation. It is significantly larger than previous collections, providing 286,000 training screenshots and 3.5 million individual tasks, along with a 2,000-example evaluation suite to test model performance.
Rendering-as-Supervision
To create such a large dataset efficiently, the authors developed a technique called "rendering-as-supervision." Instead of manually labeling screenshots, they extract ground-truth data directly from the software's own geometry. For example, when creating text-highlighting tasks, they use PDF coordinate data; for spreadsheets, they use a "probe" method that detects cell positions by temporarily changing their colors. This automated approach allows for high-precision, pixel-accurate labels at a fraction of the cost and effort required by manual annotation or OCR-based methods.
Performance and Results
The researchers tested several leading proprietary and open-weight models on the DragOn benchmark. Currently, most frontier models score below 30% accuracy, highlighting that drag grounding remains a difficult challenge for modern vision-language models. To demonstrate the value of their data, the team fine-tuned a Qwen VLM on the DragOn dataset. This fine-tuned model outperformed all other tested frontier models, suggesting that specialized training data is a key missing piece for enabling more capable, agentic computer use.
Key Considerations
While DragOn provides a major step forward, the researchers note that drag actions involve different levels of complexity. Some tasks, like text highlighting, are direction-agnostic, while others, like resizing an element, require precise, ordered movements. Furthermore, some actions have "degrees of freedom"—meaning there are multiple valid ways to complete a drag—which the team addressed by defining canonical targets for their training and evaluation. By isolating these specific subtasks, the benchmark helps researchers identify exactly where and why GUI agents fail in real-world environments.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!