Back to AI Research

AI Research

DragOn: A Benchmark and Dataset for Drag-Based GUI... | AI Research

Key Takeaways

  • GUI agents—AI models designed to control computers by interacting with screens—have made significant progress in clicking on buttons and icons.
  • GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks.
  • While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g.
  • drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions.
  • We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation.
Paper AbstractExpand

GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude) and open-weight (Qwen, Kimi, Holo) models, as well as a Qwen VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.

GUI agents—AI models designed to control computers by interacting with screens—have made significant progress in clicking on buttons and icons. However, these agents often struggle with "drag" actions, such as highlighting text, resizing windows, or moving sliders. These interactions are essential for many daily computer tasks, yet current datasets for training these models are too small to teach them how to perform these movements accurately.

Introducing DragOn

The researchers introduce DragOn, a new benchmark and training dataset specifically designed to improve how AI models handle drag-based interactions. The dataset covers four common GUI domains: text highlighting, spreadsheet cell selection, element resizing, and slider manipulation. It is significantly larger than previous collections, providing 286,000 training screenshots and 3.5 million individual tasks, along with a 2,000-example evaluation suite to test model performance.

Rendering-as-Supervision

To create such a large dataset efficiently, the authors developed a technique called "rendering-as-supervision." Instead of manually labeling screenshots, they extract ground-truth data directly from the software's own geometry. For example, when creating text-highlighting tasks, they use PDF coordinate data; for spreadsheets, they use a "probe" method that detects cell positions by temporarily changing their colors. This automated approach allows for high-precision, pixel-accurate labels at a fraction of the cost and effort required by manual annotation or OCR-based methods.

Performance and Results

The researchers tested several leading proprietary and open-weight models on the DragOn benchmark. Currently, most frontier models score below 30% accuracy, highlighting that drag grounding remains a difficult challenge for modern vision-language models. To demonstrate the value of their data, the team fine-tuned a Qwen VLM on the DragOn dataset. This fine-tuned model outperformed all other tested frontier models, suggesting that specialized training data is a key missing piece for enabling more capable, agentic computer use.

Key Considerations

While DragOn provides a major step forward, the researchers note that drag actions involve different levels of complexity. Some tasks, like text highlighting, are direction-agnostic, while others, like resizing an element, require precise, ordered movements. Furthermore, some actions have "degrees of freedom"—meaning there are multiple valid ways to complete a drag—which the team addressed by defining canonical targets for their training and evaluation. By isolating these specific subtasks, the benchmark helps researchers identify exactly where and why GUI agents fail in real-world environments.

Comments (0)

No comments yet

Be the first to share your thoughts!