AI Research

ShopGym: An Integrated Framework for Realistic Simu... | AI Research

Key Takeaways

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents Developing AI agents capable of shopping online...
Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison.
We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible.
We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents.
ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks.

Paper AbstractExpand

Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
Developing AI agents capable of shopping online is difficult because existing testing environments force a choice between two extremes. Live websites offer high realism but are constantly changing, making them unpredictable and difficult to use for scientific research. Conversely, manually built sandbox environments are stable and easy to control, but they often lack the complexity and variety of real-world e-commerce sites. ShopGym addresses this bottleneck by providing a framework that creates realistic, self-contained, and reproducible simulation environments, allowing researchers to benchmark web agents with both consistency and accuracy.

Building Realistic Simulations with ShopArena

ShopArena is the simulation layer of the framework. It functions as a pipeline that transforms live e-commerce websites into "sandbox shops." Instead of trying to create an exact clone of a specific store, it extracts the structural and behavioral essence of a site—such as its navigation, product catalog, and filtering logic—to build a synthetic, anonymized version. This process is split into two phases: an exploration phase that gathers data into a human-readable specification, and a generation phase that uses this specification to build a functional, code-based storefront. Because the specification is independent of the live site, researchers can edit these environments to test specific scenarios without needing to re-crawl the original source.

Generating Grounded Tasks with ShopGuru

Once a sandbox shop is created, ShopGuru generates the actual tasks for AI agents to perform. These tasks are "grounded," meaning they are specifically tailored to the unique catalog, policies, and navigation structure of the generated shop. ShopGuru creates two types of tasks: short-horizon tasks that test basic skills like searching for a product or applying a filter, and long-horizon tasks that simulate complex, multi-step shopping journeys. By using a verification loop, the framework ensures that every task is actually possible to complete within the specific shop, preventing the AI from being asked to perform actions that the environment does not support.

Validating Performance and Reliability

To prove the effectiveness of the framework, the authors tested ShopGym using 224 tasks across six different sandbox shops, half of which were built using synthetic data and half using real-world data. The results showed that these simulated environments successfully preserved the key structural properties of live storefronts. Most importantly, the research found a positive correlation between how well an agent performed in the ShopGym sandbox and how well it performed on live websites. This suggests that ShopGym provides a reliable, stable, and scalable way to evaluate agent capabilities without the noise and unpredictability of testing on live, changing websites.

Comments (0)

No comments yet

Be the first to share your thoughts!