What the paper is about
Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.
What it covers
Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User’s Digital World Yusong Lin 1,2,† Xinyuan Liang 2,3,† Haiyang Wang 2, 🖂 Qipeng Gu 2 Siqi Cheng 2 Jiangui Chen 2 Shuzhe Wu 2 Feiyang Pan 2 Lue Fan 4 Sanyuan Zhao 1, 🖂 Dandan Tu 2, 🖂 1 Beijing Institute of Technology 2 Huawei Technologies Co., Ltd 3 Peking University 4 Institute of Automation, Chinese Academy of Sciences 🖂 Corresponding authors † Intern at Huawei Code: github.com/LiberCoders/Claw-Anything Dataset: LiberCoders/Claw-Anything {linyusong4, [email protected]} Scaling Agent Context: See Anything, then Do Anything. Abstract Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user’s digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure. Figure 1: Overview of Claw-Anything and its empirical value. Left: Claw-Anything gives an always-on personal assistant broader access to the user’s digital world, spanning services, devices, and long-horizon event streams, thereby expanding the range of tasks it can complete. Right: Enabled by our data pipeline, our model achieves the best pass@1 among open-weight models. The yellow region represents closed-source models, and the horizontal axis does not correspond to model sizes. 1 Introduction Recent agent systems, such as the OpenClaw series [ 19 , 8 , 16 ] and Hermes Agent [ 17 ] , are moving beyond one-shot task solving toward always-on personal assistance. Deployed within users’ digital environments and equipped with long-term memory and background execution, these systems are expected to provide continuous, context-sensitive support over time. Yet user intent and activity are inherently distributed across heterogeneous digital artifacts, including historical events, backend services, and multiple devices. Effective assistance therefore requires broad access to the user’s digital world, so that an agent can both perceive relevant state and act on it in a closed loop. Figure 2: Three dimensions along which Claw-Anything expands agent context. Left: Long-horizon event streams provide a more complete view of the user’s digital activity and support inference over evolving context. Middle: Access to multiple backend services enables cross-service coordination within a unified workflow. Right: Access across devices allows the agent to integrate distributed information and actions, broadening the range of tasks it can complete. Motivated by this shift, we argue that the effectiveness of personal assistants depends fundamentally on their operational scope: the set of digital states they can observe and the actions they can execute. As shown in Figure 1 , expanding this scope enlarges both the task space an agent can address and the context over which it can reason, enabling coordination across otherwise disconnected parts of the user’s digital world. Similar patterns appear in other areas of AI: coding agents require access to the full codebase and executable environment to resolve realistic bugs [ 10 , 33 , 26 ] , while autonomous vehicles depend on broad sensor coverage for safe operation [ 24 ] . Consistent with this trend, recent systems increasingly expose richer digital interfaces to agents. Open-source projects such as CLI-Anything [ 7 ] and Gym-Anything [ 1 ] , as well as commercial platforms such as Google Workspace [ 6 ] and Feishu [ 12 ] , provide unified interfaces or programmable endpoints, making diverse software systems accessible to agents. These developments indicate that widening an agent’s operational scope is critical for enabling it to perform complex tasks across the real-world digital environment. However, current evaluation paradigms remain poorly aligned with this objective. Existing benchmarks [ 31 , 4 , 11 , 5 , 21 ] typically expose only narrow, static slices of user state, omitting long-horizon activity, cross-service dependencies, and interaction across devices. As a result, they provide limited evidence about how agents perform when operating in richer, more realistic digital environments. To address this gap, we introduce Claw-Anything, a benchmark for evaluating personal-assistant agents under substantially broader access to the user’s digital world. As illustrated in Figure 2 , Claw-Anything expands agent context along three dimensions: i) long-horizon event streams that connect past and present through months of fine-grained activity records; ii) diverse, interdependent backend services spanning the principal digital spaces users inhabit; and iii) multiple devices with heterogeneous interfaces, including both GUI and CLI interaction. In this setting, the agent must integrate fragmented information and coordinate actions across time, services, and devices. The expanded context scope also enables evaluation of proactive assistance [ 25 , 27 ] , requiring the agent to anticipate user needs and provide timely recommendations from context rather than merely react to explicit requests. Constructing such environments at scale is challenging: it requires modeling extended time horizons, numerous services, and multiple devices while preserving realism and cross-component consistency. We therefore develop an automated pipeline that jointly synthesizes digital worlds and tasks. Starting from a minimal persona seed, an LLM-based simulator incrementally expands the user’s digital world through multi-round event injection. At each step, it samples everyday events from a seed pool and updates both persistent world state and dynamic service traces, including sources such as email, calendars, and social platforms. Over time, the event history accumulates, the persona becomes more fully specified, and the environment acquires richer states and realistic noise, including irrelevant or contradictory events. Given the resulting digital world, the next event is instantiated as a persona-grounded task with an executable verifier, casting evaluation as completing the next step in an evolving digital life. Using this pipeline, we construct 200 human-verified evaluation tasks and 2,000 training environments, enabling Claw-Anything to function both as a benchmark and as scalable data infrastructure. Table 1: Comparison of representative digital-agent benchmarks and Claw-Anything across three context-scaling dimensions, event streams, device interfaces, and services, plus proactivity. “Event Stream” denotes records of user activity in the digital environment; “Device Interfaces” the interaction surfaces in each task; “Services” the average and maximum number of services used per task; “Context Length by words” the length of textualized static states and dynamic event streams; and “Proactive” whether a task rewards action before an explicit user request. Benchmark Event Stream Device Interfaces
Services
(avg. / max.) Proactive
Context
Length by words
Ins
(Eval)
Ins
(Train) ClawBench [ 31 ] ✗ CLI 1.6 (5) ✗ 2.2k 313 0 WildClawBench [ 4 ] ✗ CLI 0.5 (3) ✗ 2.6k 60 0 PinchBench [ 11 ] ✗ CLI 0.1 (3) ✗ 1.7k 53 0 ClawMark [ 5 ] ✗ CLI 3.9 (5) ✗ 2.0k 100 0 QwenClawBench [ 21 ] ✗ CLI 0.3 (6) ✗ 12.1k 100 0 Claw-Eval [ 29 ] ✗ CLI 1.3 (6) ✗ 5.3k 300 0 Claw-Anything (ours) ✓ CLI + GUI 10.1 (18) ✓ 191.7k 200 2000 Experiments reveal a substantial gap between current capabilities and the demands of full-access personal assistance. On Claw-Anything, GPT-5.5 achieves only 34.5% on pass@1, substantially below performance reported on prior benchmarks. Several models that perform strongly on existing benchmarks also fail on ours, suggesting that Claw-Anything exposes failure modes underrepresented in prior evaluations and that current models remain unreliable even when given broader access to the user’s digital world. Moreover, fine-tuning Qwen3.5-27B on 1,500 successful trajectories generated from the aforementioned training environments yields a 23.7% improvement, indicating that Claw-Anything serves not only as a challenging benchmark but also as a practical source of scalable supervision. In summary, our contributions are fourfold. 1) We identify the alignment between agent access and the user’s digital world as a central challenge for personal-assistant agents, encompassing long-horizon event streams, interconnected services, and multi-device interaction. 2) We develop an automated pipeline for jointly simulating digital worlds and synthesizing tasks at scale, and use it to construct Claw-Anything, a benchmark of 200 human-verified task environments that expands agent context jointly along these dimensions while evaluating proactivity as a distinct capability, as shown in Table 1 . 3) Through evaluation on Claw-Anything, we show that even GPT 5.5 attains only about 34.5% success. 4) The same pipeline also yields 2,000 training environments, and fine-tuning Qwen3.5-27B on successful trajectories derived from them improves success by about 23.7%, establishing Claw-Anything not only as a benchmark but also as a scalable data-generation pipeline. 2 Related Work Benchmarks for Personal Assistant. As claw-style agents have rapidly gained momentum, a growing family of benchmarks has emerged to measure their capabilities. ClawBench [ 31 ] broadens coverage across a large set of standardized digital tasks, WildClawBench [ 4 ] moves evaluation into more realistic open environments, PinchBench [ 11 ] centers on practical personal-productivity scenarios, ClawMark [ 5 ] studies longer-horizon professional workflows, QwenClawBench [ 21 ] emphasizes execution in realistic user-distributed CLI tasks, and Claw-Eval [ 29 ] advances evaluation methodology through rubric-based assessment for open-ended trajectories. Collectively, these benchmarks have advanced the study of planning, tool use, and grounded interaction for digital agents. Yet they still largely cast the agent as a solver of localized tasks rather than an always-on assistant embedded in the user’s broader digital world. Most remain confined to isolated, short-horizon, and relatively clean settings, offering limited traction on reasoning over noisy event streams, coordinating across devices and backend systems, or acting from accumulated personal context. To address this gap, Claw-Anything evaluates how agents perform when asked to operate over a much broader slice of the user’s digital world, including long-horizon activity streams, interconnected systems, heterogeneous devices, and proactive opportunities. Scaling Agentic Training Environment. In software-agent research, prior work on scalable environments has mainly followed two directions: code-centric scenaris [ 10 , 32 ] , such as SWE-smith [ 28 ] and SWE-Gym [ 20 ] ; and terminal-centric scenarios [ 26 ] , such as CLI-Gym [ 13 ] , and TermiGen [ 34 ] . Together, these works suggest that scalable environments matter not only for evaluation, but also for broader agent development. This paradigm, however, remains underexplored in personal-assistant settings, where verifiable environment often depend on manual construction, limiting both realism and scalability. In this paper, we fill this gap by combining a realistic setting across services, time, and devices with a multi-round automated pipeline that jointly simulates personas, histories, and cross-service states. The resulting framework enables controlled variation in task difficulty and environmental complexity, providing a practical basis for scalable evaluation and development of personal-assistant agents. 3 Methodolgy Claw-Anything is a benchmark for evaluating whether an agent can complete both reactive and proactive personal-assistant tasks when endowed with broad access to a user’s digital world. Each task is grounded in a coherent persona and embedded in an environment spanning three contextual dimensions: long-horizon history, diverse backend services, and coordinated interactions across multiple devices with heterogeneous interfaces (e.g., GUI and CLI). Within this setting, the agent must isolate task-relevant signals from substantial background noise and execute required actions. Figure 3: Claw-Anything environment and automated data pipeline. Left: The environment comprises connected devices with system event streams and multiple services with persistent states and service-specific histories. Right: From a persona-grounded initial state, the pipeline iteratively samples task or noise templates and uses an LLM-based simulator to adapt events and update the world state. A final simulation generates the task query, reference solution, and grader; automatic filtering then yields task instances, with optional human verification for benchmark cases. 3.1 Task Formulation As illustrated in the left panel of Figure 3 , Claw-Anything first places the agent in a digital environment with access to as much of the user’s digital world as possible, then formulates both reactive and proactive personal-assistant queries in this environment, and finally evaluates task completion with an executable verifier over the resulting interaction trace and task outcome. Context-rich digital environment. We instantiate each task in a context-rich, realistic, and noisy digital environment. Formally, each environment is defined as ℰ = ( 𝒫 , 𝒟 , ℱ , ℒ ) \mathcal{E}=(\mathcal{P},\mathcal{D},\mathcal{F},\mathcal{L}) , where 𝒫 \mathcal{P} denotes a user persona specifying the user’s profile and preferences; 𝒟 \mathcal{D} denotes a set of devices with heterogeneous interfaces, including CLI-based computers and GUI-based mobile phones; ℱ \mathcal{F} denotes a fixture bank of persistent states across more than forty backend services spanning lifestyle, work, and related domains; and ℒ \mathcal{L} denotes a long-horizon activity stream covering over three months of system-level and service-specific logs. We further populate these environments with irrelevant events, services, and state to better approximate real-world settings, requiring agents to reason over large-scale context and complete tasks in a closed loop. Queries across time, services, and devices. Each query is written in naturalistic and sometimes underspecified language, reflecting how users communicate in real personal-assistant settings. Solving these queries require the agent to identify task-relevant signals in the event stream and integrate information across services and devices, including CLI-based Linux Docker environments and GUI-based Android Docker environments. Beyond explicit requests, we also incorporate the heartbeat-style mechanism of OpenClaw, in which the agent periodically monitors the user’s digital environment and produces contextually grounded recommendations without direct prompting. Outcome-oriented evaluation for multi-path tasks. Our evaluation builds on the rubric-based framework of Claw-Eval [ 29 ] , combining rule-based checks with LLM judgments to produce both a soft score and a binary pass/fail label. Because many tasks admit multiple valid solution paths, we assign greater weight to the final outcome and correspondingly less to intermediate actions. This modification retains the strengths of rubric-based evaluation while better reflecting the open-ended nature of personal-assistant tasks. Algorithm 1 Automated task generation pipeline. Input: seed persona 𝒫 0 \mathcal{P}{0} ; task-seed pool 𝒮 \mathcal{S} ; noise-event pool 𝒩 \mathcal{N} ; rollout horizon R R ; snapshot rounds ℐ task \mathcal{I}{\mathrm{task}} . Initialize: fixture state ℱ ← ∅ \mathcal{F}!\leftarrow!\emptyset , event log ℒ ← ∅ \mathcal{L}!\leftarrow!\emptyset , persona state 𝒫 ← 𝒫 0 \mathcal{P}!\leftarrow!\mathcal{P}{0} , and task set 𝒯 ← ∅ \mathcal{T}!\leftarrow!\emptyset . for r = 1 , … , R r=1,\dots,R do 1. e ← Sample ( 𝒮 , 𝒩 , noise _ ratio ) e!\leftarrow!\mathrm{Sample}(\mathcal{S},\mathcal{N},\mathrm{noise_ratio}) Sample a task or noise event 2. e ~ ← AdaptToEnv ( e , 𝒫 , ℱ , ℒ ) \tilde{e}!\leftarrow!\mathrm{AdaptToEnv}(e,\mathcal{P},\mathcal{F},\mathcal{L}) Ground it in the current environment 3. Use an LLM to generate updates Δ ℱ , Δ ℒ , Δ 𝒫 \Delta\mathcal{F},\Delta\mathcal{L},\Delta\mathcal{P} from e ~ \tilde{e} 4. Update the environment: ℱ ← ℱ ∪ Δ ℱ \mathcal{F}!\leftarrow!\mathcal{F}\cup\Delta\mathcal{F} , ℒ ← ℒ ∪ Δ ℒ \mathcal{L}!\leftarrow!\mathcal{L}\cup\Delta\mathcal{L} , 𝒫 ← 𝒫 ∪ Δ 𝒫 \mathcal{P}!\leftarrow!\mathcal{P}\cup\Delta\mathcal{P} 5. if r ∈ ℐ task r\in\mathcal{I}{\mathrm{task}} then X r ← Snapshot ( ℱ , ℒ , 𝒫 , r ) X_{r}!\leftarrow!\mathrm{Snapshot}(\mathcal{F},\mathcal{L},\mathcal{P},r) Snapshot the current environment Q r ← GenTaskQuery ( X r ) Q_{r}!\leftarrow!\mathrm{GenTaskQuery}(X_{r}) Generate a task query V r , A ref , r ← GenVerifier ( Q r , X r ) V_{r},A_{\mathrm{ref},r}!\leftarrow!\mathrm{GenVerifier}(Q_{r},X_{r}) Generate the verifier and reference answer τ r ← AutoFilter ( X r , Q r , V r , A ref , r ) \tau_{r}!\leftarrow!\mathrm{AutoFilter}(X_{r},Q_{r},V_{r},A_{\mathrm{ref},r}) Filter the task instance if τ r ≠ ∅ \tau_{r}\neq\varnothing then 𝒯 ← 𝒯 ∪ { τ r } \mathcal{T}!\leftarrow!\mathcal{T}\cup{\tau_{r}} Output: Task set 𝒯 . \mathcal{T}. May undergo human verification for benchmark cases. 3.2 Construction Pipeline Manually constructing a context-rich digital world together with its associated tasks is prohibitively expensive and difficult to scale. We therefore generate both evaluation and training data with an automatic pipeline, illustrated in Algorithm 1 and Figure 3 , that incrementally builds an evolving user environment, extracts tasks from intermediate states, and removes low-quality instances. Stage I: Iterative digital environment synthesis. We first construct an evolving digital environment through an iterative generation loop. At each round, the pipeline samples either a task template or a noise template from a predefined seed pool and conditions the LLM on the current persona and world state to generate the corresponding fixtures, event logs, and persona updates. Over multiple rounds, an initially sparse persona is transformed into a temporally coherent environment with accumulated event streams and richer cross-component dependencies, providing the substrate for subsequent task construction. Stage II: Task and verifier generation. We then derive tasks from designated rounds of the simulation. For each selected round, the pipeline captures the corresponding environment state and prompts the LLM on it to generate three coupled artifacts: a user query, an executable verifier, and a reference solution. Each task is thereby grounded in a specific temporal slice of the same evolving digital world, rather than synthesized from an isolated static state. Stage III: Automatic filtering. Because the pipeline depends on LLM generation, automated quality control is necessary. We therefore combine rule-based checks with LLM-based filtering to remove invalid instances before human review. Rule-based checks target surface inconsistencies, such as references to nonexistent tools or services. LLM-based filtering then evaluates higher-level validity by using the environment state and reference solution to determine whether a task is solvable and whether its verifier is logically consistent with the specification. Stage IV: Human verification with execution support. Finally, we perform human verification supplemented by execution-based validation. A strong agent is given the reference solution and asked to execute the task in the environment with the verifier. Successful execution indicates that the task admits at least one valid solution consistent with the intended logic, enabling human reviewers to focus on assessing the consistency among the query, environment, and verifier. Instances that fail execution are escalated for manual review to determine whether they should be revised or discarded. 3.3 Claw-Anything Category Metric Claw-Eval Claw-Anything Eval Train Size
Instance
300 200 2000 Context Text
Word of fixture
5.3k 108.0k 97.3k
Word of log
0 83.7k 65.7k Services
Task-involved
1.3 10.1 9.2
Env-support
19 35 35 Devices Support Type CLI CLI + GUI CLI + GUI Figure 4: Benchmark statistics of Claw-Anything. Left: Comparison with Claw-Eval in size, context length, services per task, and supported devices. Right: Category distribution of evaluation instances. Benchmark Statistics. As shown in Figure 4 , the full pipeline, including fourth-stage human verification, yields an evaluation set of 200 tasks, comprising 150 CLI-only tasks and 50 CLI+GUI tasks across 9 major categories. Compared with Claw-Eval, Claw-Anything provides a substantially richer perceptual context, with much longer temporal horizons, broader service coverage, denser cross-service dependencies, and task environments that require coordination across multiple devices. Trajectory Collection with Claw-Anything. For training trajectory collection, we execute the first three stages of the automated pipeline to generate 2,000 task environments. To prevent contamination of the evaluation set, these environments are drawn from a persona pool fully disjoint from the evaluation personas. We then collect 1,500 successful trajectories from these environments for the subsequent post-training of Qwen3.5-27B. Table 2: Main results on the Claw-Anything benchmark. We evaluate both state-of-the-art open- and closed-source models under a unified OpenHarness framework for fair comparison. The best result in each column is shown in bold. Model
Params
Score Pass@1 Pass@3 Pass^3
Tokens (I / O)
Open-Source Qwen3.5-27B [ 22 ] 27B 0.50 9.8 19.0 2.0 83.8M / 0.9M MiniMax-M2.7 [ 14 ] 229B 0.52 13.5 28.5 3.5 79.0M / 1.1M Qwen3.6-27B [ 23 ] 27B 0.58 22.5 42.0 6.0 99.4M / 2.0M Kimi-K2.6 [ 15 ] 1.1T 0.57 22.8 44.0 6.5 178.1M / 2.3M GLM-5.1 [ 30 ] 754B 0.59 31.7 47.0 17.0 125.0M / 2.2M Claw-Anything-Qwen3.5-27B (ours) 27B 0.61 33.5 52.0 15.5 117.8M / 1.1M Gain over Qwen3.5-27B - +0.11 +23.7 +33.0 +13.5 - Closed-Source Claude Sonnet 4.5 [ 2 ] - 0.59 28.0 45.0 12.0 149.0M / 1.5M Claude Opus 4.7 [ 3 ] - 0.62 31.8 48.0 13.5 123.5M / 1.5M GPT-5.5 [ 18 ] - 0.65 34.5 53.5 20.0 77.7M / 0.9M 4 Experiment 4.1 Main Results of Claw-Anything Frontier baselines. We benchmark a broad set of frontier LLMs, covering open-source families such as Qwen series [ 22 , 23 ] , MiniMax 2.7 [ 14 ] , GLM 5.1 [ 30 ] , and Kimi 2.6 [ 15 ] , as well as closed-source models including Claude Opus 4.7 [ 3 ] and GPT-5.5 [ 18 ] . All models are evaluated under OpenHarness [ 9 ] , a widely adopted ultra-lightweight agent scaffold for personal agents implemented in pure Python. Following Claw-Eval, we use Claude Sonnet 4.5 as judge model and report Pass@1, Pass@3, and Pass^3 as the primary metrics, where Pass^3 requires success in all three independent runs. We further use continuous execution score and token consumption as complementary indicators of solution quality. Table 2 summarizes the results. Even the strongest closed-source model reaches only 20.0% on Pass^3, which suggests that bringing the agent’s perceptual scope closer to that of the user materially increases benchmark difficulty, because success now depends on both accurate understanding of the user’s digital environment and correct action grounded in that context. Improvement from collected training trajectories. We further assess whether the automated pipeline serves not only as an evaluation infrastructure but also as a source of effective training data. Specifically, we construct 2,000 training tasks, collect 1,500 successful trajectories, and use them to fine-tune Qwen3.5-27B for 10 epochs. The resulting models improve over its base model by 23.7% on pass@1, outperform all other open-source baselines on Claw-Anything, and reduce the gap to closed-source models. Figure 6 further shows that performance increases steadily with the number of collected training trajectories. Together, these results indicate that data produced by our pipeline is effective for post-training and yields substantial gains on this benchmark. 4.2 Ablation Study We conduct ablations on the key design choices of Claw-Anything, including scaling context in Section 4.2.1 , data pipeline in Section 4.2.2 , and evaluation setting in Section 4.2.3 . Due to space constraints, additional experimental details are provided in the appendix. 4.2.1 Scaling Context This section ablates whether expanding the agent’s operational scope unlocks previously infeasible tasks, and whether larger context constitutes a fundamental bottleneck for current agents. Long-horizon event streams. We ablate both the availability of event streams and the length of history exposed to the agent. As shown in Table 3 , success rates drop substantially when event streams are removed, because many of these tasks inherently depend on information contained in the event history rather than in the static service fixtures alone. This finding supports our central claim that event streams enlarge the set of solvable tasks by extending the agent’s operational scope toward that of the user. Figure 5 further shows that, even when event streams are available, performance degrades as the history grows longer, suggesting that current models still struggle to effectively leverage long-horizon context despite having a broader field of view. Cross-backend services. We ablate multi-service coordination by masking the tools required for tasks that span multiple backend services. As shown in Table 3 , success rates collapse to nearly zero once these tools are removed, indicating that many tasks intrinsically require the agent to retrieve information and execute actions across services rather than within a single isolated backend. This result underscores the importance of granting personal-assistant agents access to a digital ecosystem. Figure 5 further shows that, even when all relevant tools are available, performance declines as the number of involved services increases. This trend suggests that cross-service coordination remains a major challenge for current models and a key target for future improvement. CLI–GUI collaboration. We further ablate cross-interface coordination by removing GUI access and restricting the agent to CLI-only execution. As shown in Table 3 , tasks that intrinsically require CLI–GUI collaboration become nearly unsolvable in this setting, whereas restoring joint CLI+GUI access make them tractable again. At the same time, Figure 5 shows that even with both interfaces available, performance on CLI–GUI collaborative tasks remains substantially below that on pure CLI tasks. Taken together, these results show that connecting CLI and GUI unlocks a new boundary of solvable task for agents, while robust coordination across heterogeneous interaction modalities remains a major challenge for current agent systems. (a) Long-horizon Event-stream. (b) Multipile Backend Services. (c) Cross-device (CLI + GUI). Figure 5: Ablation of contextual scale, showing the effects of event-stream volume and the number of services on average score, as well as the effect of GUI access on Pass@1. Table 3: Effects of access to event streams, cross-service environments, and cross-device interaction on benchmark performance, together with a comparison between proactive and reactive tasks. All results are reported in Pass@1. Factor w/ w/o Event Stream 21.0 0.0 Cross-services 24.0 0.0 Cross-devices 16.0 2.0 Task Type Reactive Proactive Pass@1 25.9 6.7 Figure 6: Trajectory scaling. (a) Ratio of noise rounds. (b) Simulation Rounds. (c) Fixture-level conflicts. Figure 7: Ablation of the automatic data-generation pipeline, showing the effects of the noise-round ratio, the number of simulation rounds, and the number of fixture-level conflicts. Table 4: Skill-loading ablation. We compare full and lazy loading across models. Under lazy loading, the agent must select tools autonomously, making the setting much more challenging. All results are reported in Pass@1. Model Full Lazy Minimax 2.7 22.7 10.0 Qwen3.6-27B 24.7 23.7 GLM-5 29.3 14.0 Claude Sonnet 4.6 43.0 26.7 Figure 8: Visualization of Failure modes. 4.2.2 Data pipeline Noise injection ratio. Our generation pipeline injects a controllable amount of background noise into the user’s digita
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!