PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
Large language model (LLM) inference is a major driver of energy consumption in modern data centers. While existing systems focus on throughput and latency, they typically treat GPU power as a fixed constraint rather than a flexible resource. PALS (Power-Aware LLM Serving) is a new runtime system that treats GPU power caps as a primary control knob. By jointly optimizing hardware power limits with software parameters like batch size, PALS enables data centers to balance performance targets with energy efficiency in real time.
Coordinating Hardware and Software
The core innovation of PALS is the recognition that hardware power settings and software scheduling parameters are deeply interconnected. In current systems, these are often managed independently, leading to inefficiencies. For example, compute-bound models benefit from higher power caps to increase clock speeds, while communication-bound models—such as Mixture-of-Experts (MoE) architectures—often see diminishing returns or efficiency losses when power is pushed too high, as the extra energy is spent on communication overhead rather than computation. PALS coordinates these knobs to navigate this complex trade-off, ensuring that the system operates at the most efficient point for any given workload.
How PALS Works
PALS functions as a closed-loop control system that integrates directly into existing frameworks like vLLM without requiring model retraining or API changes. The system operates through three main layers:
Telemetry: It continuously monitors real-time GPU power, throughput, and utilization.
Control: It uses lightweight offline power-performance models to predict the best configuration. A feedback loop then adjusts these decisions every 500ms to account for workload fluctuations or changing power budgets.
Actuation: It dynamically enforces the chosen power caps and batch sizes, allowing the system to adapt to external signals like real-time electricity pricing or facility-level power limits.
Key Performance Results
By treating power as a first-class control primitive, PALS expands the range of achievable performance, reaching operating points that were previously inaccessible. Across various multi-GPU systems and model types, the system achieves:
Up to 26.3% improvement in energy efficiency.
A 4x to 7x reduction in Quality-of-Service (QoS) violations when operating under strict power constraints.
The ability to effectively track and adhere to dynamic power budgets, making it suitable for grid-interactive and carbon-aware computing environments.
Considerations for Deployment
The effectiveness of PALS is highly dependent on the specific characteristics of the model being served. The research highlights that compute-heavy models and communication-heavy models respond differently to power scaling. Because MoE models can become communication-bound as they scale across multiple nodes, PALS is designed to account for these bottlenecks. By providing a plug-and-play solution that does not require modifications to model architecture, PALS offers a practical path for operators to improve the energy proportionality of their AI infrastructure.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!