Back to AI Research

AI Research

Semi-Markov Reinforcement Learning for City-Scale E... | AI Research

Key Takeaways

  • This paper introduces a new framework for managing large-scale electric vehicle (EV) ride-hailing fleets.
  • To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor.
  • These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints.
  • To mitigate distributional shifts, we optimize a Soft Actor--Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations.
  • The robust backup uses the Kantorovich--Rubinstein dual, a projected subgradient inner loop, and a primal--dual risk-budget update.
Paper AbstractExpand

We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor--Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich--Rubinstein dual, a projected subgradient inner loop, and a primal--dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD--RSAC achieves the highest net profit, reaching \$1.22M, compared with \$0.58M--\$0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.

This paper introduces a new framework for managing large-scale electric vehicle (EV) ride-hailing fleets. The goal is to optimize complex operational decisions—such as dispatching vehicles to passengers, moving idle cars to high-demand areas, and managing charging schedules—while strictly adhering to physical constraints like battery limits and power grid capacities. The authors propose a system that combines advanced machine learning with mathematical optimization to ensure that fleet operations remain profitable and safe, even when faced with unpredictable changes in city traffic and demand.

Balancing Flexibility and Safety

A major challenge in fleet management is that standard AI models often struggle to guarantee safety; they might suggest an action that is profitable but physically impossible, such as charging too many vehicles at once and overloading the local power grid. To solve this, the researchers use a "two-layer" approach. First, an AI agent learns to produce "intentions" or high-level strategies. These intentions are then passed through a mathematical filter called a rolling Mixed-Integer Linear Program (MILP). This filter acts as a safety guard, adjusting the AI’s suggestions in real-time to ensure they strictly obey all power and battery constraints before any action is actually taken.

Handling Uncertainty with Robust AI

Transportation systems are inherently volatile, with demand and travel times shifting constantly. To prevent the AI from becoming "brittle" or failing when real-world conditions differ from training data, the authors use a technique called Distributionally Robust Optimization. They define a "Wasserstein ambiguity set," which essentially creates a safety buffer around the training data. By using a specialized graph-based metric, the model accounts for the city’s spatial layout—recognizing, for example, that a surge in demand in one neighborhood is likely to affect its immediate neighbors. This makes the fleet controller much more resilient to unexpected fluctuations in city-wide activity.

Performance and Real-World Impact

The researchers tested their framework using a simulator built on real-world NYC taxi data. The results showed that their proposed method, known as PD-RSAC, significantly outperformed existing approaches. While traditional heuristics and standard reinforcement learning models achieved net profits between $0.58M and $0.70M, the PD-RSAC framework reached $1.22M. Crucially, the system maintained zero violations of power grid limits, demonstrating that it is possible to achieve high economic efficiency without compromising the stability of the charging infrastructure.

Key Considerations

The framework is designed as a semi-Markov decision process, which is particularly well-suited for this problem because tasks like driving a passenger or charging a battery take different amounts of time. By accounting for these variable durations, the model makes better long-term decisions about when to prioritize charging versus serving a ride. While the system is highly effective, it relies on the ability to solve the MILP projection within a strict time limit. To ensure the system never stalls, the authors included a "greedy fallback" procedure that guarantees a valid, safe action is always produced, even if the primary solver cannot find an optimal solution in time.

Comments (0)

No comments yet

Be the first to share your thoughts!