Workflow-GYM: Towards Long-Horizon Evaluation of Co... | AI Research

Key Takeaways

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields introduces a new benchmark designed to test how...
Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks.
However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains.
To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments.
Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

Paper AbstractExpand

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields introduces a new benchmark designed to test how well AI agents can perform complex, multi-step tasks within professional software. While many existing AI benchmarks focus on simple, short-term actions like web browsing, this project shifts the focus to "workflows"—structured, goal-directed sequences of actions that mimic the work performed by professionals in specialized fields.

Bridging the Gap in AI Evaluation

Current AI benchmarks often rely on general-purpose software or short-horizon tasks, which fail to capture the reality of professional work. To address this, the researchers created a benchmark consisting of 338 tasks across 6 professional domains, such as data analysis, engineering, and finance. These tasks are designed to be realistic, requiring domain-specific knowledge and the use of specialized software. Each task is set up in a fully configured virtual machine, ensuring that the agent must navigate the software interface just as a human would, without shortcuts or external hints.

How the Benchmark Works

The construction of Workflow-GYM relies on a rigorous, multi-stage pipeline. First, domain experts contribute tasks based on their actual daily work to ensure authenticity. These tasks are then filtered to ensure they are complex (requiring at least 30 steps), domain-specific, and verifiable. The team uses a modular environment setup where professional software is pre-installed in virtual machines, allowing for consistent and repeatable testing. Finally, each task is validated through expert execution, automated quality checks, and preliminary agent testing to ensure that the instructions are clear and the goals are achievable.

Key Findings and Performance

The researchers tested several state-of-the-art AI models on these tasks and found that professional, long-horizon workflows remain a significant challenge. Even the top-performing models achieved only slightly above a 30% success rate, a stark contrast to the higher success rates these models often see on simpler, general-purpose benchmarks. The results also show a clear trend: as the number of steps required to complete a task increases, the success rate of the agents drops significantly.

Common Failure Modes

The study highlights several recurring reasons why current agents struggle with these professional tasks. Agents frequently exhibit "objective drift," where they lose sight of the original goal, or suffer from "workflow stage omission," where they skip necessary steps. Additionally, agents often struggle with error propagation—where a single mistake early in the process ruins the entire workflow—and demonstrate an insufficient understanding of how to operate specialized software. These findings suggest that the next generation of GUI-agent research must focus on improving long-range planning, memory, and the ability to handle complex, multi-step interactions.

Comments (0)

No comments yet

Be the first to share your thoughts!