ClawGym is a scalable framework designed to support the full development lifecycle of "Claw-style" personal AI agents. These agents are unique because they operate directly within a user's computer environment, managing local files, executing tools, and navigating persistent workspace states to complete multi-step tasks. Because these agents must handle ambiguous instructions and complex, real-world digital workflows, they require specialized training and evaluation that standard text-based AI benchmarks cannot provide. ClawGym addresses this by providing a systematic way to synthesize training data, train capable agent models, and rigorously evaluate their performance.
Generating Scalable Training Data
To overcome the lack of high-quality data for environment-grounded agents, the framework introduces ClawGym-SynData, a collection of 13.5K tasks. The team uses a dual-route synthesis strategy to ensure both variety and realism. The "top-down" approach starts with user personas and scenarios to create tasks that reflect realistic daily needs. The "bottom-up" approach focuses on technical capabilities, composing tasks from specific, executable OpenClaw skills. Each task is paired with a mock workspace and a hybrid verification system—using both code-based checks and qualitative rubrics—to ensure the agent’s actions are objectively correct and aligned with user intent.
Training Capable Agents
Using the synthesized dataset, the researchers developed a family of models called ClawGym-Agents. These models are trained through supervised fine-tuning on high-fidelity interaction trajectories collected from black-box rollouts within the OpenClaw environment. To further improve performance, the framework includes a lightweight reinforcement learning pipeline that parallelizes task execution across multiple sandboxes. This allows the agents to learn from their own successes and failures in a controlled, efficient manner, leading to more reliable performance in complex, multi-step workflows.
Evaluating Performance
Reliable evaluation is a core component of the framework, addressed by ClawGym-Bench. This benchmark consists of 200 carefully selected instances that were filtered through both automated difficulty calibration and human-LLM review to ensure they are challenging and representative. Testing shows that models trained with the ClawGym framework achieve significant performance gains. For example, smaller models like Qwen3-8B saw improvements of over 38% on existing benchmarks and over 43% on the new ClawGym-Bench. Larger models also demonstrated clear progress, confirming that the framework effectively bridges the gap between general language modeling and practical, environment-grounded digital execution.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!