Back to AI Research

AI Research

MobileGym: A Verifiable and Highly Parallel Simulat... | AI Research

Key Takeaways

  • What the paper is about We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fi...
  • We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends.
  • The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start.
  • What the paper is about We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends.
  • The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances ( ∼ \sim 400 MB each, ∼ \sim 3 s cold start).
Paper AbstractExpand

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: this https URL .

What the paper is about

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: this https URL .

What it covers

MobileGym : A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research Dingbang Wu 1,* Rui Hao 1,* Haiyang Wang 2 Shuzhe Wu Han Xiao 3 Zhenghong Li 1 Bojiang Zhou 1 Zheng Ju 1 Zichen Liu 1 Lue Fan 1, † , ‡ \dagger,\ddagger Zhaoxiang Zhang 1, † \dagger 1 Institute of Automation, Chinese Academy of Sciences 2 Peking University 3 The Chinese University of Hong Kong [email protected] , [email protected]

  • Equal contribution. † \dagger Corresponding authors. ‡ \ddagger Project lead. Project page: https://mobilegym.github.io Abstract We present MobileGym , a browser-hosted, lightweight, fully controllable environment for everyday mobile use , targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances ( ∼ \sim 400 MB each, ∼ \sim 3 s cold start). A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates (256 test + 160 train) over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains + + 12.8 pt on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. rmTeXGyreTermesX MobileGym : A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research Dingbang Wu 1,* Rui Hao 1,* Haiyang Wang 2 Shuzhe Wu Han Xiao 3 Zhenghong Li 1 Bojiang Zhou 1 Zheng Ju 1 Zichen Liu 1 Lue Fan 1, † , ‡ \dagger,\ddagger Zhaoxiang Zhang 1, † \dagger 1 Institute of Automation, Chinese Academy of Sciences 2 Peking University 3 The Chinese University of Hong Kong [email protected] , [email protected]
  • Equal contribution. † \dagger Corresponding authors. ‡ \ddagger Project lead. Project page: https://mobilegym.github.io . 1 Introduction Figure 1: Example screens from MobileGym . Annotated launcher and messaging screens showing MobileGym ’s configurable and sandboxed capabilities. Figure 2: End-to-end workflow of MobileGym . A structured state supports task instantiation, parallel rollout forking, and state-diff verification. The resulting judgments are then converted into benchmark metrics and RL rewards. Mobile GUI agents have advanced rapidly in operating smartphones from screenshots and natural-language instructions Qin et al. ( 2025 ); Liu et al. ( 2024 ); Venus-Team et al. ( 2026 ); Xu et al. ( 2026 ); Xiao et al. ( 2025 ) , yet current evaluation and training environments remain divided by a basic trade-off. Emulator-based environments such as AndroidWorld and AndroidLab Rawles et al. ( 2025 ); Xu et al. ( 2025 ) offer repeatable evaluation but mainly cover system utilities and simple open-source apps, and scaling to online training requires many heavyweight emulator instances. Real-device benchmarks such as MobileBench-OL Wu et al. ( 2026a ) reach everyday apps, but live accounts, backend state, app-version drift, real-world consequences, and the cost of maintaining many devices and accounts make episodes difficult to control, reproduce, and parallelize. Neither route provides the combination needed for progress. First, environments need verifiable outcome signals , so benchmark verdicts and RL rewards are deterministic and grounded in actual task state rather than unreliable VLM judgments. Second, they need scalable online training : online RL has become an important capability driver for GUI agents Venus-Team et al. ( 2026 ); Wang et al. ( 2025 ); Zhang et al. ( 2025 ) , while offline trajectories struggle to cover dynamic GUI variations Zhou et al. ( 2025 ) . The barriers are inherent to how everyday apps work. Everyday-app state is unreadable : internal state such as balances and orders is difficult to inspect through adb and accessibility trees, while VLM judges are intrinsically unreliable and further constrained by discrete screenshots that provide only partial evidence. It is unwritable : reproducible evaluation and online RL require resetting to known initial conditions, yet task-relevant state is split across proprietary storage, caches, and remote services, making desired states difficult to configure or restore. It is unforkable : large-scale online training benefits from parallel rollouts, and group-based methods such as GRPO further require multiple rollouts from identical initial states, yet live apps provide neither cheap replication nor state forking. Finally, many actions are irreversible , risking real messages, real transfers, or permanent account changes. These constraints make everyday apps structurally resistant to reproducible experimentation, even though they are natural targets for mobile-agent research. Scalability poses a further challenge: even for the apps emulators do support, each instance requires gigabytes of RAM, making large-scale parallel rollouts impractical on commodity hardware — let alone for everyday apps that are resource-intensive or restrict emulator execution. Yet GUI agents observe only screenshots and act through discrete actions, so a lightweight simulator with fully programmable state can be sufficient — it only needs interaction fidelity , producing realistic screens in response to agent actions. We introduce MobileGym , a browser-hosted Android-like simulation environment built on this principle. App data, OS state, and device context are represented as structured JSON, and the same mechanism makes state readable for deterministic outcome checking, writable for configuration and reset, forkable for parallel rollouts, and fully sandboxed for high-consequence actions. Agents observe only screenshots, while researchers retain full programmatic control. Each browser instance uses roughly 400 MB of RAM and cold-starts in about 3 s, enabling hundreds of parallel instances on a single server. For query tasks, a structured AnswerSheet protocol replaces brittle free-text matching with typed, GUI-submitted fields. Figure 1 shows example simulated screens, and Figure 2 shows the end-to-end pipeline. Our main contributions are:

• The MobileGym platform (§ 3 ): a lightweight, browser-hosted Android-like simulation environment, including 12 everyday apps covering the major categories of daily mobile use and 16 system apps. Its modular app architecture and declarative task framework support easy extension, and a single machine can host hundreds of parallel instances.

• Programmable state and verification mechanisms (§ 3.2 , § 4.2 ): full-environment state represented as structured JSON that supports deterministic judging, snapshot-based rollout forking, side-effect detection, and a typed AnswerSheet protocol that avoids free-text matching failures.

• MobileGym-Bench (§ 4 ): 416 parameterized task templates (256 test + 160 train) covering major categories of everyday mobile use, with deterministic judges, empirically calibrated difficulty strata, and diagnostic metrics.

• Empirical validation (§ 5 ): benchmark results across 9 agents (9.4%–58.8% SR), a GRPO training study gaining + + 12.8 pt on the 256-task test set, a real-device study retaining 95.1% of the simulated gain on a real-device signal subset, and a VLM-judge audit showing 10.2% misjudgment. 2 Related Work AndroidWorld AndroidLab MobileWorld MobileBench-OL MobileGym Runtime Emulator Emulator Emulator Real device Browser App types System + open-source System + open-source System + surrogates Real everyday apps Simulated everyday + system Everyday app coverage ✗ ✗ Surrogates ✓ (real) ✓ (simulated) Apps / task units 20 / 116 templates 9 / 138 instances 20 / 201 tasks 80 / 1080 instances 28 / 416 templates Verification adb programmatic UI-tree + LLM SQL + hooks XPath rules State-based programmatic Full environment state comparison ✗ ✗ ✗ ✗ ✓ Snapshot & restore App-data snapshot AVD snapshot AVD snapshot ✗ JSON (ms-level) State customization Limited Limited Partial (SQL) ✗ Full (JSON) Online RL-ready Limited Limited Limited ✗ ✓ Memory per instance ∼ \sim 4.5 GB ∼ \sim 6 GB ≥ \geq 4.5 GB N/A ∼ \sim 400 MB Disk footprint ∼ \sim 20 GB ∼ \sim 9 GB ≥ \geq 20 GB N/A ∼ \sim 50 MB Cold start ∼ \sim 78 s — — N/A ∼ \sim 3 s Parameterized tasks ✓ ✗ ✗ ✗ ✓ Sim-to-Real validation N/A N/A N/A N/A Validated Table 1: Comparison of mobile GUI agent benchmarks and infrastructures. Task-unit labels follow each benchmark’s native counting unit. AndroidLab additionally releases 10.5k offline SFT trajectories, not counted here. Validated denotes the real-device transfer study in § 5.2 , where 95.1% of the simulation-side training gain on the 59-task signal subset is retained. Resource details are in Appendix M . Real-device and emulator route. Existing mobile GUI agent environments run tasks on a heavyweight Android emulator or physical device and judge them externally, either through programmatic queries to interfaces such as adb , accessibility trees, UI-tree, or XPath rules, or through VLM-based screenshot judges. On system utilities and open-source apps, deterministic verification is feasible: AndroidWorld Rawles et al. ( 2025 ) judges 116 emulator tasks through adb , AndroidLab Xu et al. ( 2025 ) adds UI-tree matching with an LLM verifier for query-answer subtasks, and MobileWorld Kong et al. ( 2025 ) queries backend databases directly. A3 Chai et al. ( 2025 ) targets 20 mainstream Google Play apps via Appium and adopts MLLM-as-judge to handle their dynamic content, trading determinism for coverage. MobileBench-OL Wu et al. ( 2026a ) runs 1080 tasks across 80 Chinese-language everyday apps on physical phones, the closest prior attempt at real everyday-app evaluation. Its XPath rules are brittle to unexpected popups and to minor app or backend updates, and the physical-device setup does not support parallel rollouts. All inherit the constraints discussed in § 1 . Table 1 compares representative environments. Other mobile GUI benchmarks. SPA-Bench Chen et al. ( 2025 ) , Mobile-Bench Deng et al. ( 2024 ) , ProBench Yang et al. ( 2025a ) , MVISU-Bench Huang et al. ( 2025 ) , UI-NEXUS Guo et al. ( 2025 ) , and ColorBench Song et al. ( 2025 ) contribute task suites along axes orthogonal to environment infrastructure, and inform MobileGym-Bench ’s taxonomy design (§ 4.1 ). Synthesis and trajectory-replay environments. GUI-Genesis Cao et al. ( 2026 ) reconstructs real apps as lightweight web environments from interaction trajectories with code-native rewards, but each environment is tied to a single trajectory, which limits RL training that requires diverse initial states. UISim Xiang et al. ( 2025 ) and ViMo Luo et al. ( 2025a ) adopt image-generation approaches. However, visual prediction errors can accumulate over long horizons, making these environments less suitable for RL with deterministic state transitions. OpenApps Ullrich et al. ( 2025 ) focuses on reliability measurement with 6 FastHTML applications and shares the lightweight design philosophy of MobileGym , while pursuing a different goal. Verifiable environments in other domains. Beyond mobile, verifiable interactive environments have been built in the web domain (WebShop Yao et al. ( 2022 ) , WebArena Zhou et al. ( 2024 ) , VisualWebArena Koh et al. ( 2024 ) , WebGym Bai et al. ( 2026 ) , AutoWebWorld Wu et al. ( 2026b ) ), the desktop OS domain (OSWorld Xie et al. ( 2024 ) , macOSWorld Yang et al. ( 2025b ) ), and over simulated Python APIs (AppWorld Trivedi et al. ( 2024 ) ). RL-based GUI agent training. DigiRL Bai et al. ( 2024 ) demonstrates a substantial advantage of online RL over SFT for device control. UI-TARS-2 Wang et al. ( 2025 ) deploys thousands of VMs to enable large-scale RL rollouts. UI-Venus-1.5 Venus-Team et al. ( 2026 ) introduces full-trajectory online RL with model fusion and achieves 77.6% SOTA on AndroidWorld. GUI-Owl-1.5 Xu et al. ( 2026 ) proposes the MRPO algorithm to address conflicts in multi-platform RL training. MobileGUI-RL Shi et al. ( 2025 ) , Mobile-R1 Gu et al. ( 2025 ) , UI-R1 Lu et al. ( 2026 ) , GUI-R1 Luo et al. ( 2025b ) explore curriculum-style and R1-style training. 3 The MobileGym Platform Figure 3: System capabilities and state model of MobileGym . App views are produced by composing read-mostly World Data , a per-environment Runtime Overlay , and the OS Runtime . The resulting structured environment state supports snapshot/reset/fork and deterministic state-diff judging. MobileGym is a browser-hosted Android-like simulation environment. Its app data, OS settings, and device properties are represented as explicit structured state, which the benchmark layer can configure, reset, snapshot, fork, and compare (Figure 3 ). 3.1 System Design Interaction fidelity target. MobileGym does not aim to reproduce real everyday app backends or pixel-level Android internals. Its target is the interaction surface available to GUI agents: visual screens, touch and typing responses, navigation, cross-app handoffs, and task-relevant state transitions. As summarized in Figure 3 , this requires Android-like runtime mechanisms such as task stacks, keyboard, notification, and permission flows, shared resources, intent routing, content sharing, and back-key dispatch. These mechanisms are implemented in the browser over structured local state, making the same interaction semantics readable, writable, and forkable for evaluation and RL. Implementation details are in Appendix A . Layered state model. The environment separates large, mostly read-only world data , compact per-environment runtime state , and OS runtime state. World data contains public entities such as posts and products, while runtime state contains data that can be changed by the agent, such as the current user’s profile or app settings. Agent operations write only to runtime state, and views are produced by overlaying this layer on the read-only world data. Only runtime state is exposed for configuration, reset, judging, and comparison, keeping snapshots small and stable while preserving all agent-induced changes for full-environment state comparison. Declarative navigation specification. The UI navigation of every app is modeled as a declarative finite-state machine, built at development time into a per-app specification file. The same file drives runtime navigation and static analysis, including task-trajectory enumeration, and auto-generation of new tasks. The formal definition and guard syntax are provided in Appendix B . Interface and extensibility. The Benchmark layer maps agent outputs to a unified 17-action abstraction (Appendix C ), executes actions through Playwright with coordinates normalized to [ 0 , 1000 ] [0,1000] , and returns only screenshots. On the app side, MobileGym provides a repeatable module architecture that separates UI pages, app-local runtime state, declarative navigation, replaceable default data, and world data, allowing new apps and features to reuse the shared OS lifecycle, reset, snapshot, rollout, and judging interfaces (Appendix A ). 3.2 State Programmability Verifiable outcome signals. Task success is judged by programmatic state verification : each task has a deterministic judge that inspects environment state. This provides deterministic, fine-grained outcome signals without unreliable VLM judgments. State serialization and multi-instance replication. The full environment state can be serialized as structured JSON and restored on demand, enabling exact reset and forking from any snapshot, supporting RL methods such as GRPO. For irreversible operations (transfers, deactivation, deletions, etc.), the consequence-free simulator allows full restoration after each trajectory. Full environment state comparison. The fully structured state enables full-environment state comparison between an episode’s initial and terminal states, reporting any mutation outside the task’s expected outcome as an unexpected side effect . For personal mobile agents, this distinction is critical: an agent may complete the requested goal while, for example, sending an unintended message. This mechanism defines the Unexpected Side Effects metric (§ 4.3 ). Existing programmatic mobile benchmarks do not provide this environment-wide signal, and VLM judges can only approximate it from screenshots without deterministic guarantees. 4 The MobileGym-Bench MobileGym-Bench is a suite of 416 parameterized task templates (256 test + 160 train, strictly disjoint) built on top of the MobileGym platform. It covers major categories of everyday mobile use. Detailed information about the 28 apps and representative task examples is listed in Appendix D . 4.1 Task Taxonomy Prior task taxonomies often couple unrelated dimensions, such as mixing app count with subtask count Deng et al. ( 2024 ) . We factor the task space along four orthogonal axes:

• Scope — how many apps a task involves. S1: single-app, S2: two-app, S3: three or more.

• Objective — what the task asks for. Operate: state-changing actions, query: information retrieval, hybrid: both.

• Composition — how subtasks are structured. Atomic: a single action, sequential: an ordered chain, transfer: cross-app handoff, deep-dive: multi-step drill-down.

• Difficulty — how hard the task is for current models. L1: easy, L2: moderate, L3: hard, L4: very hard. Calibrated post-hoc using eight reference models, details in § 4.4 . Each task is additionally annotated with 1–4 capability tags from a 13-tag vocabulary. The full taxonomy and tag definitions are provided in Appendix E . 4.2 Task Design Two design choices shape the task suite: parameterized instantiation for diversity, and AnswerSheet fields for query-task judging reliability. Parameterized task instantiation. The 416 entries in MobileGym-Bench are templates , not fixed instances. Each template is instantiated at runtime through three sources of variation: (i) instruction variation , where semantically equivalent goal phrasings are sampled; (ii) parameter sampling , where slot values are drawn from curated sets, numeric ranges, or the current environment state; and (iii) environment configuration , where app state such as contacts or order history is set through shared base data or per-task injections before rollout. Together, these variations reduce memorization of fixed instances and expand task diversity without requiring each instance to be authored separately. Across finite parameter ranges, they yield over 27,000 distinct task instances, not counting templates with continuous ranges that contribute unbounded additional instances. The AnswerSheet protocol. Existing mobile benchmarks often judge free-text query answers with string-similarity or substring heuristics Rawles et al. ( 2025 ); Kong et al. ( 2025 ) , which can reject equivalent phrasings or falsely accept answers that leak reasoning text containing the gold answer. MobileGym instead moves answer submission into the GUI: query tasks end with the agent filling an AnswerSheet form whose fields declare types and show format hints (Figure 4 ). This preserves a natural form-filling interaction for GUI-specialized agents, while the submitted typed state is checked by type-specific matchers such as exact text, numeric tolerance, format, or choice checks. Details are in Appendix F . Figure 4: AnswerSheet protocol. Free-text heuristics can reject equivalent answers or accept leaked reasoning that contains the gold answer. AnswerSheet instead uses GUI form filling and type-specific checks over typed fields. 4.3 Evaluation Protocol We report success, progress, termination, and side-effect diagnostics under fixed step budgets. Metrics. Success Rate (SR) , the fraction of tasks judged successful, is the primary metric. Diagnostics include Progress Rate (PR) , the fraction of subtasks passed; False Complete (FC) , episodes where the agent declares completion without success; Unexpected Side Effects (USE) , episodes with unexpected state changes; and Overdue Termination (OT) , episodes where the agent reaches the goal but continues until truncation. Execution setup. The simulator is reset before each task, and agents observe only screenshots. Each task is assigned one of four step budgets (15, 30, 45, or 60), manually verified to comfortably exceed its optimal completion length. Tasks with AnswerSheet receive an additional 15-step budget. 4.4 Model-Calibrated Difficulty Strata Motivated by benchmark-curation precedents such as BBH, which identifies hard tasks using prior model and human-rater performance Suzgun et al. ( 2023 ) , four difficulty levels are assigned by post-hoc empirical calibration . We evaluate eight reference models 1 1 1 Gemini 3.1 Pro, Doubao-Seed-2.0-Pro, Qwen3.6-Plus, AutoGLM-Phone-9B, UI-TARS-1.5-8B, UI-Venus-1.5-8B, GUI-Owl-1.5-8B-Think, Step-GUI-4B. on the test set and stratify tasks by mean SR and PR: L1 (SR ≥ {\geq} 75%, PR ≥ {\geq} 75%, n = 20 n{=}20 ), L2 (remaining tasks with SR ≥ {\geq} 25%, PR ≥ {\geq} 50%, n = 73 n{=}73 ), L3 (remaining tasks with SR > {>} 0, PR ≥ {\geq} 25%, n = 83 n{=}83 ), and L4 (otherwise, n = 80 n{=}80 ). These are diagnostic strata rather than intrinsic labels, and the calibration excludes Qwen3-VL-4B-Instruct and its fine-tuned variants used in § 5.2 . A reference-model robustness check is reported in Appendix I . 5 Experiments We evaluate 9 agents on MobileGym-Bench (Table 2 ). Open-source models use 4 trials with re-sampled parameters; proprietary models use one due to API cost, with one additional run for Gemini 3.1 Pro, the strongest model, to estimate variation. 5.1 Benchmark Results Overall (%) Difficulty SR (%) Diagnostics (%) Model SR PR L1 (20) L2 (73) L3 (83) L4 (80) FC OT USE Proprietary models Gemini 3.1 Pro 58.8 ± \pm 1.4 72.1 97.5 83.6 63.3 21.9 34.0 0.2 5.5 Doubao-Seed-2.0-Pro 52.0 † 63.6 100.0 93.2 48.2 6.2 33.6 0.4 4.7 Qwen3.6-Plus 45.7 † 59.2 100.0 78.1 44.6 3.8 34.0 0.4 14.5 Open-source GUI-specialized models AutoGLM-Phone-9B 20.0 ± \pm 1.3 35.3 86.2 33.6 9.6 1.9 39.6 0.6 12.6 UI-TARS-1.5-8B 13.8 ± \pm 1.7 26.3 77.5 21.9 3.0 1.6 38.6 0.2 11.0 UI-Venus-1.5-8B 15.4 ± \pm 2.4 28.3 85.0 21.9 6.0 1.9 22.9 0.5 7.7 GUI-Owl-1.5-8B-Think 15.1 ± \pm 0.9 28.8 76.2 26.0 4.2 1.2 30.4 0.9 14.1 Step-GUI-4B 12.9 ± \pm 1.1 25.7 83.8 17.8 2.4 1.6 37.0 0.8 7.6 Open-source generalist models Qwen3-VL-4B-Instruct 9.4 ± \pm 0.6 20.1 71.2 12.3 0.6 0.3 15.9 0.4 10.0 Table 2: Main results on the MobileGym-Bench test set (256 tasks). Overall reports Success Rate (SR) and Progress Rate (PR); Difficulty SR reports SR within calibrated difficulty strata L1–L4, with task counts in parentheses; Diagnostics report False Complete (FC), Overdue Termination (OT), and Unexpected Side Effects (USE). ± \pm denotes standard deviation across trials; † marks single-run results. See § 4.3 for metrics and Appendix H for details. Two observations stand out from Table 2 . Additional experimental results are in Appendix H . Difficulty stratification. SR decreases monotonically from L1 to L4 for all 9 models, while overall SR spans 9.4%–58.8%, giving a 6 × \times performance range without top saturation or bottom floor effects. L1 already separates proprietary and open-source agents, and L4 acts as the frontier discriminator: only Gemini 3.1 Pro retains meaningful performance at 21.9%, while all other proprietary models reach at most 6.2% and all open-source GUI specialists at most 1.9%. Unexpected side effects. USE captures unintended agent operations that modify state unrelated to the task. It does not simply decrease with model capability: across the 9 models it ranges from 4.7% to 14.5%, and even open-source GUI specialists with similar SRs (12.9–15.4%) differ nearly 2 × \times in USE (7.6–14.1%). This diagnostic is enabled by MobileGym ’s full-environment state comparison. Screenshot or UI-tree judges cannot reliably expose off-target changes hidden in app-internal or backend state. 5.2 Sim-to-Real Transfer We view this real-device experiment as an existence proof that training in MobileGym can produce behavior that survives real-device execution, not as a comprehensive sim-to-real study. We fine-tune Qwen3-VL-4B-Instruct with GRPO Shao et al. ( 2024 ) on MobileGym ’s 160-task train set for 10 steps, using a single node with 3 RTX Pro 6000s and 96 parallel environment instances. Key hyperparameters are lr = 10 − 6 \mathrm{lr}=10^{-6} , group size k = 8 k{=}8 , batch size b ​ s = 12 bs{=}12 , KL 0.01, DAPO Yu et al. ( 2025 ) -style asymmetric clip-higher (0.2/0.28). The reward is a PR-shaped dense signal, with multiplicative penalties for AnswerSheet error, side effects, false completion, and overdue/post-success abort. Details are provided in Appendix G . Training gains on the simulation side. Training raises overall SR from 9.4% to 22.2% ( + + 12.8 pt ) on the 256-task MobileGym-Bench test set. Broken down by difficulty, SR changes from 71.2% to 92.5% on L1, 12.3% to 37.7% on L2, 0.6% to 11.7% on L3, and 0.3% to 1.2% on L4. The lift is largest on L2 and nearly flat on L4, suggesting that training is most effective where the base model already exhibits moderate capability, while the hardest tasks remain capacity-limited. The trained 4B model surpasses the 9B AutoGLM-Phone-9B on L1–L3, while both remain near zero on L4. Real-device evaluation design. We evaluate on a Redmi Note 12 Turbo ( 1080 × 2400 1080\times 2400 ). We stratify the 256-task test set by the base/trained models’ pass counts over four simulator rollouts: Uplift (base ≤ \leq 1, trained ≥ \geq 3; 26 tasks), Stable-pass (both ≥ \geq 3; 21 tasks), Mid (all remaining cases; 20 tasks), Regression (base ≥ \geq 3, trained ≤ \leq 1; 0 tasks), and Stable-fail (both ≤ \leq 1; 189 tasks). The three signal buckets (Uplift, Mid, Stable-pass) contain 67 tasks, of which 59 can be safely and equivalently run on the real device after excluding 8 tasks involving unreproducible account state or irreversible operations. Running all 189 Stable-fail tasks on a single real device serially would be costly and manual state restoration, while these tasks, by definition, exhibit no simulation-side training gain to transfer. We therefore randomly sample 15 as a negative-control check. The real-device setting differs from simulation in UI details, app data, real-app variability, and task entities such as contacts or POIs. Details in Appendix H.5 . Figure 5: Sim-to-Real transfer of GRPO training gains. Per-bucket Success Rate on the 59-task signal-bucket subset and the overall Signal Total. In the legend, Sim/Real denotes the evaluation environment and Base/Trained denotes before/after GRPO. Sim columns are 4-seed averages, Real columns are pass@1 and all manually audited (Appendix J ). Results. On the 59-task signal-bucket subset, training raises the real-device pass rate from 32.2% to 72.9% ( + + 40.7 pt ), closely matching the simulation-side increase from 33.9% to 76.7% ( + + 42.8 pt ). This corresponds to 95.1% retained gain . The absolute sim–real gaps are small for both the base model (1.7 pt) and the trained model (3.8 pt) (Figure 5 ). The 15 randomly sampled Stable-fail tasks yield 0/15 success for both models on the real device, consistent with simulation. Because the real-device environment and task entities differ from the simulator, the lift more plausibly reflects transferable policies than memorization. Trajectory-length check. Successful trajectories on operate tasks have similar lengths in simulation and on device: 5.00 vs. 6.03 steps for the base model and 10.08 vs. 12.20 for the trained model. We exclude query / hybrid tasks because AnswerSheet adds simulator-only steps. Details in Appendix H.6 . Failure-recovery example. In a real-device Reddit post-creation task, the selected community required a flair before the Post button could be enabled. The base model looped on the disabled button until the step budget was exhausted, while the trained model identified the missing flair requirement and succeeded in 22 steps (Appendix K ). VLM judge error analysis. Manual review of all 118 signal-bucket real-device trajectories identifies 12 errors for Qwen3.6-Plus (5/59 base, 7/59 trained; 10.2% overall). Re-judging the same trajectories with GPT-5.4 OpenAI ( 2026 ) also yields 12/118 errors, although on a partially different subset. These results suggest that VLM-judge errors are not specific to a single judge model, while programmatic state verification avoids this failure mode. Detailed counts are in Appendix J . 5.3 Efficiency Analysis MobileGym runs in the browser with roughly 1/10 the memory and under 1/100 the disk footprint of emulator-based setups (Table 1 ). In our measurements, 256 parallel instances on one server used < < 10% CPU and ∼ \sim 100 GB RAM, completing a full 256-task benchmark evaluation in about 6 minutes. By contrast, MAI-UI Zhou et al. ( 2025 ) reports requiring 10 bare-metal cloud servers (960 vCPUs, 3,840 GB RAM total) to reach 512 parallel Android-emulator instances for online RL. This single-node scalability makes concurrent online RL (§ 5.2 ) feasible without dedicated cluster infrastructure. Appendix N quantifies the API cost of using VLM

Comments (0)

No comments yet

Be the first to share your thoughts!