A History-Aware Visually Grounded Critic for Computer Use Agents
Computer Use Agents (CUAs) are AI models designed to perform complex, multi-step tasks on computers by interacting with graphical user interfaces (GUIs). While these agents are powerful, they often struggle with long-horizon tasks, frequently making mistakes like clicking the wrong buttons or forgetting what they have already accomplished. The paper introduces HiViG, a framework designed to act as a "critic" that evaluates an agent's proposed actions before they are executed, preventing errors and keeping the agent on track.
The Problem with Current Agents
Existing AI agents often fail because they lack two critical capabilities. First, they suffer from "short-sightedness," meaning they lose track of previous steps and struggle to maintain a coherent plan over long periods. Second, they lack "visual grounding." Many current systems rely on the agent's own textual description of its intent rather than checking the actual screen. This leads to errors where an agent might think it is clicking the correct button, but is actually clicking empty space or a different UI element entirely.
How HiViG Works
HiViG functions as a specialized, multimodal critic that integrates into the agent's decision-making loop. It provides two primary forms of support:
Macro-Action History: Instead of just remembering a long list of individual clicks, the system summarizes past interactions into "macro-achievements." This allows the agent to track its global progress toward a goal, helping it avoid redundant or circular decision-making.
Visually Grounded Critique: Before an action is executed, the critic verifies the raw pixel coordinates against the current screenshot. It uses a visual marker to see exactly where the agent intends to click. If the action is flawed, the critic identifies the specific error—such as a visual hallucination or a procedural mistake—and provides corrective feedback to the agent before the action takes place.
Performance and Results
The researchers tested HiViG across three different computing environments: web (WebArenaLitev2), mobile (AndroidLab), and desktop (WindowsAgentArena). The framework was applied to both open-source models (Qwen3-VL-32B) and frontier closed-source models (Gemini-3-Flash).
The results showed that HiViG consistently outperformed existing scalar and verbal critics. For example, it improved the success rate of the Gemini-3-Flash model by 9.0% on average across all tested platforms. These gains suggest that by providing proactive, visually grounded feedback, the framework helps agents navigate complex tasks more reliably than standard methods that only offer simple scores or lack historical context.
Why This Matters
By intercepting errors before they occur, HiViG addresses the fact that many GUI actions are irreversible. Because the framework relies on raw visual input rather than platform-specific data like DOM trees or accessibility files, it demonstrates strong cross-platform generalization. This makes it a versatile tool for improving the reliability of AI agents across the diverse range of interfaces found on modern computers and mobile devices.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!