Back to AI Research

AI Research

A History-Aware Visually Grounded Critic for Comput... | AI Research

Key Takeaways

  • A History-Aware Visually Grounded Critic for Computer Use Agents Computer Use Agents (CUAs) are AI models designed to perform complex, multi-step tasks on co...
  • Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.
  • A History-Aware Visually Grounded Critic for Computer Use Agents Computer Use Agents (CUAs) are AI models designed to perform complex, multi-step tasks on computers by interacting with graphical user interfaces (GUIs).
  • A History-Aware Visually Grounded Critic for Computer Use Agents
  • Computer Use Agents (CUAs) are AI models designed to perform complex, multi-step tasks on computers by interacting with graphical user interfaces (GUIs).
Paper AbstractExpand

Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

A History-Aware Visually Grounded Critic for Computer Use Agents
Computer Use Agents (CUAs) are AI models designed to perform complex, multi-step tasks on computers by interacting with graphical user interfaces (GUIs). While these agents are powerful, they often struggle with long-horizon tasks, frequently making mistakes like clicking the wrong buttons or forgetting what they have already accomplished. The paper introduces HiViG, a framework designed to act as a "critic" that evaluates an agent's proposed actions before they are executed, preventing errors and keeping the agent on track.

The Problem with Current Agents

Existing AI agents often fail because they lack two critical capabilities. First, they suffer from "short-sightedness," meaning they lose track of previous steps and struggle to maintain a coherent plan over long periods. Second, they lack "visual grounding." Many current systems rely on the agent's own textual description of its intent rather than checking the actual screen. This leads to errors where an agent might think it is clicking the correct button, but is actually clicking empty space or a different UI element entirely.

How HiViG Works

HiViG functions as a specialized, multimodal critic that integrates into the agent's decision-making loop. It provides two primary forms of support:

  • Macro-Action History: Instead of just remembering a long list of individual clicks, the system summarizes past interactions into "macro-achievements." This allows the agent to track its global progress toward a goal, helping it avoid redundant or circular decision-making.

  • Visually Grounded Critique: Before an action is executed, the critic verifies the raw pixel coordinates against the current screenshot. It uses a visual marker to see exactly where the agent intends to click. If the action is flawed, the critic identifies the specific error—such as a visual hallucination or a procedural mistake—and provides corrective feedback to the agent before the action takes place.

Performance and Results

The researchers tested HiViG across three different computing environments: web (WebArenaLitev2), mobile (AndroidLab), and desktop (WindowsAgentArena). The framework was applied to both open-source models (Qwen3-VL-32B) and frontier closed-source models (Gemini-3-Flash).
The results showed that HiViG consistently outperformed existing scalar and verbal critics. For example, it improved the success rate of the Gemini-3-Flash model by 9.0% on average across all tested platforms. These gains suggest that by providing proactive, visually grounded feedback, the framework helps agents navigate complex tasks more reliably than standard methods that only offer simple scores or lack historical context.

Why This Matters

By intercepting errors before they occur, HiViG addresses the fact that many GUI actions are irreversible. Because the framework relies on raw visual input rather than platform-specific data like DOM trees or accessibility files, it demonstrates strong cross-platform generalization. This makes it a versatile tool for improving the reliability of AI agents across the diverse range of interfaces found on modern computers and mobile devices.

Comments (0)

No comments yet

Be the first to share your thoughts!