Back to AI Research

AI Research

ClawGym: A Scalable Framework for Building Effectiv... | AI Research

Key Takeaways

  • ClawGym is a scalable framework designed to support the full development lifecycle of "Claw-style" personal AI agents.
  • Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states.
  • To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development.
  • Relevant resources will be soon released at this https URL .
  • These agents are unique because they operate directly within a user's computer environment, managing local files, exec...
Paper AbstractExpand

Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task this http URL support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at this https URL .

ClawGym is a scalable framework designed to support the full development lifecycle of "Claw-style" personal AI agents. These agents are unique because they operate directly within a user's computer environment, managing local files, executing tools, and navigating persistent workspace states to complete multi-step tasks. Because these agents must handle ambiguous instructions and complex, real-world digital workflows, they require specialized training and evaluation that standard text-based AI benchmarks cannot provide. ClawGym addresses this by providing a systematic way to synthesize training data, train capable agent models, and rigorously evaluate their performance.

Generating Scalable Training Data

To overcome the lack of high-quality data for environment-grounded agents, the framework introduces ClawGym-SynData, a collection of 13.5K tasks. The team uses a dual-route synthesis strategy to ensure both variety and realism. The "top-down" approach starts with user personas and scenarios to create tasks that reflect realistic daily needs. The "bottom-up" approach focuses on technical capabilities, composing tasks from specific, executable OpenClaw skills. Each task is paired with a mock workspace and a hybrid verification system—using both code-based checks and qualitative rubrics—to ensure the agent’s actions are objectively correct and aligned with user intent.

Training Capable Agents

Using the synthesized dataset, the researchers developed a family of models called ClawGym-Agents. These models are trained through supervised fine-tuning on high-fidelity interaction trajectories collected from black-box rollouts within the OpenClaw environment. To further improve performance, the framework includes a lightweight reinforcement learning pipeline that parallelizes task execution across multiple sandboxes. This allows the agents to learn from their own successes and failures in a controlled, efficient manner, leading to more reliable performance in complex, multi-step workflows.

Evaluating Performance

Reliable evaluation is a core component of the framework, addressed by ClawGym-Bench. This benchmark consists of 200 carefully selected instances that were filtered through both automated difficulty calibration and human-LLM review to ensure they are challenging and representative. Testing shows that models trained with the ClawGym framework achieve significant performance gains. For example, smaller models like Qwen3-8B saw improvements of over 38% on existing benchmarks and over 43% on the new ClawGym-Bench. Larger models also demonstrated clear progress, confirming that the framework effectively bridges the gap between general language modeling and practical, environment-grounded digital execution.

Comments (0)

No comments yet

Be the first to share your thoughts!