Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields introduces a new benchmark designed to test how well AI agents can perform complex, multi-step tasks within professional software. While many existing AI benchmarks focus on simple, short-term actions like web browsing, this project shifts the focus to "workflows"—structured, goal-directed sequences of actions that mimic the work performed by professionals in specialized fields.
Bridging the Gap in AI Evaluation
Current AI benchmarks often rely on general-purpose software or short-horizon tasks, which fail to capture the reality of professional work. To address this, the researchers created a benchmark consisting of 338 tasks across 6 professional domains, such as data analysis, engineering, and finance. These tasks are designed to be realistic, requiring domain-specific knowledge and the use of specialized software. Each task is set up in a fully configured virtual machine, ensuring that the agent must navigate the software interface just as a human would, without shortcuts or external hints.
How the Benchmark Works
The construction of Workflow-GYM relies on a rigorous, multi-stage pipeline. First, domain experts contribute tasks based on their actual daily work to ensure authenticity. These tasks are then filtered to ensure they are complex (requiring at least 30 steps), domain-specific, and verifiable. The team uses a modular environment setup where professional software is pre-installed in virtual machines, allowing for consistent and repeatable testing. Finally, each task is validated through expert execution, automated quality checks, and preliminary agent testing to ensure that the instructions are clear and the goals are achievable.
Key Findings and Performance
The researchers tested several state-of-the-art AI models on these tasks and found that professional, long-horizon workflows remain a significant challenge. Even the top-performing models achieved only slightly above a 30% success rate, a stark contrast to the higher success rates these models often see on simpler, general-purpose benchmarks. The results also show a clear trend: as the number of steps required to complete a task increases, the success rate of the agents drops significantly.
Common Failure Modes
The study highlights several recurring reasons why current agents struggle with these professional tasks. Agents frequently exhibit "objective drift," where they lose sight of the original goal, or suffer from "workflow stage omission," where they skip necessary steps. Additionally, agents often struggle with error propagation—where a single mistake early in the process ruins the entire workflow—and demonstrate an insufficient understanding of how to operate specialized software. These findings suggest that the next generation of GUI-agent research must focus on improving long-range planning, memory, and the ability to handle complex, multi-step interactions.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!