ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
Developing AI agents capable of shopping online is difficult because existing testing environments force a choice between two extremes. Live websites offer high realism but are constantly changing, making them unpredictable and difficult to use for scientific research. Conversely, manually built sandbox environments are stable and easy to control, but they often lack the complexity and variety of real-world e-commerce sites. ShopGym addresses this bottleneck by providing a framework that creates realistic, self-contained, and reproducible simulation environments, allowing researchers to benchmark web agents with both consistency and accuracy.
Building Realistic Simulations with ShopArena
ShopArena is the simulation layer of the framework. It functions as a pipeline that transforms live e-commerce websites into "sandbox shops." Instead of trying to create an exact clone of a specific store, it extracts the structural and behavioral essence of a site—such as its navigation, product catalog, and filtering logic—to build a synthetic, anonymized version. This process is split into two phases: an exploration phase that gathers data into a human-readable specification, and a generation phase that uses this specification to build a functional, code-based storefront. Because the specification is independent of the live site, researchers can edit these environments to test specific scenarios without needing to re-crawl the original source.
Generating Grounded Tasks with ShopGuru
Once a sandbox shop is created, ShopGuru generates the actual tasks for AI agents to perform. These tasks are "grounded," meaning they are specifically tailored to the unique catalog, policies, and navigation structure of the generated shop. ShopGuru creates two types of tasks: short-horizon tasks that test basic skills like searching for a product or applying a filter, and long-horizon tasks that simulate complex, multi-step shopping journeys. By using a verification loop, the framework ensures that every task is actually possible to complete within the specific shop, preventing the AI from being asked to perform actions that the environment does not support.
Validating Performance and Reliability
To prove the effectiveness of the framework, the authors tested ShopGym using 224 tasks across six different sandbox shops, half of which were built using synthetic data and half using real-world data. The results showed that these simulated environments successfully preserved the key structural properties of live storefronts. Most importantly, the research found a positive correlation between how well an agent performed in the ShopGym sandbox and how well it performed on live websites. This suggests that ShopGym provides a reliable, stable, and scalable way to evaluate agent capabilities without the noise and unpredictability of testing on live, changing websites.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!