Back to AI Research

AI Research

ToolCUA: Towards Optimal GUI-Tool Path Orchestratio... | AI Research

Key Takeaways

  • Computer Use Agents (CUAs) are designed to automate desktop tasks by interacting with graphical user interfaces (GUIs) through clicks and typing.
  • In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm.
  • We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points.
  • Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths.
  • Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale.
Paper AbstractExpand

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: this https URL

Computer Use Agents (CUAs) are designed to automate desktop tasks by interacting with graphical user interfaces (GUIs) through clicks and typing. While these agents can also use high-level tools (like API-based file operations), they often struggle to decide when to use a tool versus when to stick to standard GUI actions. This confusion leads to inefficient, brittle, or failed task execution. The paper introduces ToolCUA, an end-to-end agent trained to master this "hybrid action space" by learning how to orchestrate GUI actions and tool calls for optimal performance.

Scaling Data Without Manual Effort

A major hurdle in training these agents is the lack of high-quality examples showing how to switch between GUI actions and tool calls. Collecting such data manually is expensive and difficult. To solve this, the researchers developed an "Interleaved GUI-Tool Trajectory Scaling Pipeline." This system takes existing, static GUI-only data and uses advanced models to synthesize a library of tools based on the actions already present in those recordings. By converting GUI-only sequences into hybrid trajectories—where some steps are replaced by tool calls—the team created a large, diverse dataset for training without needing to manually build or instrument complex environments.

Training for Better Decision-Making

Once the data was prepared, the team used a two-stage training paradigm. First, they performed "Tool-Bootstrapped GUI Reinforcement Finetuning," which uses supervised learning to teach the agent the basics of tool usage, followed by single-turn reinforcement learning to help the agent make better decisions at critical "switching points" (where it must choose between a GUI action or a tool). Second, they employed "Online Agentic Reinforcement Learning" in a live environment. During this phase, the agent is guided by a "Tool-Efficient Path Reward," which provides feedback based on two factors: whether the tool was actually appropriate for the task and whether the total number of steps was minimized.

Achieving State-of-the-Art Results

The effectiveness of ToolCUA was tested on the OSWorld-MCP benchmark, a standard for evaluating computer-use agents. ToolCUA achieved an accuracy of 46.85%, representing a 66% relative improvement over the baseline model. Notably, the agent performed better than models limited to GUI-only actions, proving that its ability to orchestrate a hybrid action space leads to more efficient and reliable automation. The results also showed that the agent could generalize its skills to unseen applications and platforms, such as different Linux tasks and Windows desktop apps.

Key Takeaways

The research suggests that the primary challenge for modern digital agents is not just the ability to use tools, but the ability to know when to use them. By moving away from simple step-by-step imitation and toward trajectory-level optimization, ToolCUA demonstrates that agents can learn to replace long, error-prone sequences of GUI clicks with precise, efficient tool calls. This approach provides a scalable path forward for creating more capable and reliable digital assistants for real-world desktop environments.

Comments (0)

No comments yet

Be the first to share your thoughts!