Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch
In large-scale food-delivery marketplaces like DoorDash, dispatch systems must constantly balance competing goals: delivering food quickly to customers while ensuring couriers are used efficiently through batching. Traditionally, these systems rely on manually tuned, static weights to manage this tradeoff. This paper introduces a reinforcement learning (RL) system that dynamically adapts these weights in real-time. Instead of replacing the core dispatch optimizer, the system acts as an intelligent "outer layer" that adjusts the optimizer’s priorities based on local marketplace conditions, such as courier availability and order volume.
A Constrained Approach to Optimization
Rather than giving an AI full control over every delivery assignment—which could risk operational instability—the researchers designed a constrained interface. A store-level RL agent observes local data and selects a multiplier that shifts the dispatch optimizer’s focus. If the agent selects a lower multiplier, the optimizer prioritizes batching and efficiency; a higher multiplier pushes the system to prioritize speed. This approach allows the platform to benefit from machine learning while keeping the existing, proven optimization logic and safety guardrails intact.
Learning from Delayed Feedback
Training an AI for a marketplace is difficult because feedback is often delayed and "noisy." For example, a dispatch decision made now might not result in a completed delivery for some time, and that outcome is influenced by many external factors. To solve this, the researchers used offline reinforcement learning. They trained a shared value function using historical data, combining local store states with regional performance outcomes. They also employed a "conservative regularizer" during training to prevent the model from overestimating the value of actions that were not well-supported by the logged data, ensuring the policy remains stable when deployed.
Real-World Performance
The system was tested in a production environment using a switchback experiment, where different geographic regions were randomly assigned to either the new RL-based policy or the traditional baseline. The results showed that the RL-trained policy successfully increased batching and reduced courier-side time costs without negatively impacting customer delivery speed. During peak dinner hours, the system even showed improvements in reducing late deliveries.
Considerations for Deployment
While the system proved effective, the authors note that it currently operates through a narrow, low-dimensional control interface to ensure reliability. Because the rewards are based on regional outcomes, there is a degree of noise in how credit is assigned to individual store-level decisions. Future development will focus on better understanding how these agents interact in a multi-agent environment and improving the ability to detect when marketplace dynamics shift, ensuring the policy remains effective as the platform evolves.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!