Back to AI Research

AI Research

Multi-Agent Reinforcement Learning from Delayed Mar... | AI Research

Key Takeaways

  • Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch In large-scale food-delivery mar...
  • We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large-scale food-delivery marketplace using delayed signals.
  • This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards.
  • We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation.
  • In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality.
Paper AbstractExpand

Dispatch in three-sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by delayed operational outcomes such as delivery speed, courier utilization, and merchant congestion. We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large-scale food-delivery marketplace using delayed signals. Rather than replacing the combinatorial assignment optimizer, a store-level policy learned from logged marketplace data selects a discrete multiplier that shifts the dispatch optimizer's tradeoff between delivery quality and batching efficiency. This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation. In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality. Results illustrate how world feedback from a live economic and logistics system can be used to safely adapt decision policies online.

Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch
In large-scale food-delivery marketplaces like DoorDash, dispatch systems must constantly balance competing goals: delivering food quickly to customers while ensuring couriers are used efficiently through batching. Traditionally, these systems rely on manually tuned, static weights to manage this tradeoff. This paper introduces a reinforcement learning (RL) system that dynamically adapts these weights in real-time. Instead of replacing the core dispatch optimizer, the system acts as an intelligent "outer layer" that adjusts the optimizer’s priorities based on local marketplace conditions, such as courier availability and order volume.

A Constrained Approach to Optimization

Rather than giving an AI full control over every delivery assignment—which could risk operational instability—the researchers designed a constrained interface. A store-level RL agent observes local data and selects a multiplier that shifts the dispatch optimizer’s focus. If the agent selects a lower multiplier, the optimizer prioritizes batching and efficiency; a higher multiplier pushes the system to prioritize speed. This approach allows the platform to benefit from machine learning while keeping the existing, proven optimization logic and safety guardrails intact.

Learning from Delayed Feedback

Training an AI for a marketplace is difficult because feedback is often delayed and "noisy." For example, a dispatch decision made now might not result in a completed delivery for some time, and that outcome is influenced by many external factors. To solve this, the researchers used offline reinforcement learning. They trained a shared value function using historical data, combining local store states with regional performance outcomes. They also employed a "conservative regularizer" during training to prevent the model from overestimating the value of actions that were not well-supported by the logged data, ensuring the policy remains stable when deployed.

Real-World Performance

The system was tested in a production environment using a switchback experiment, where different geographic regions were randomly assigned to either the new RL-based policy or the traditional baseline. The results showed that the RL-trained policy successfully increased batching and reduced courier-side time costs without negatively impacting customer delivery speed. During peak dinner hours, the system even showed improvements in reducing late deliveries.

Considerations for Deployment

While the system proved effective, the authors note that it currently operates through a narrow, low-dimensional control interface to ensure reliability. Because the rewards are based on regional outcomes, there is a degree of noise in how credit is assigned to individual store-level decisions. Future development will focus on better understanding how these agents interact in a multi-agent environment and improving the ability to detect when marketplace dynamics shift, ensuring the policy remains effective as the platform evolves.

Comments (0)

No comments yet

Be the first to share your thoughts!