Back to AI Research

AI Research

PALS: Power-Aware LLM Serving for Mixture-of-Expert... | AI Research

Key Takeaways

  • PALS: Power-Aware LLM Serving for Mixture-of-Experts Models Large language model (LLM) inference is a major driver of energy consumption in modern data cente...
  • Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption.
  • While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint rather than a controllable resource.
  • In this paper, we present a power-aware runtime for LLM serving, PALS, that treats GPU power caps as a first-class control knob and jointly optimizes them with software parameters such as batch size.
  • The system combines lightweight offline power-performance models with a feedback-driven controller to select configurations that satisfy throughput targets while maximizing energy efficiency.
Paper AbstractExpand

Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint rather than a controllable resource. In this paper, we present a power-aware runtime for LLM serving, PALS, that treats GPU power caps as a first-class control knob and jointly optimizes them with software parameters such as batch size. The system combines lightweight offline power-performance models with a feedback-driven controller to select configurations that satisfy throughput targets while maximizing energy efficiency. We implement PALS within an existing LLM serving framework, vLLM, demonstrating that it requires no model retraining or API changes. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to 7x under power constraints, and tracks dynamic power budgets. These results highlight the potential of integrating power control directly into LLM inference runtimes, enabling energy-proportional and grid-interactive AI systems.

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
Large language model (LLM) inference is a major driver of energy consumption in modern data centers. While existing systems focus on throughput and latency, they typically treat GPU power as a fixed constraint rather than a flexible resource. PALS (Power-Aware LLM Serving) is a new runtime system that treats GPU power caps as a primary control knob. By jointly optimizing hardware power limits with software parameters like batch size, PALS enables data centers to balance performance targets with energy efficiency in real time.

Coordinating Hardware and Software

The core innovation of PALS is the recognition that hardware power settings and software scheduling parameters are deeply interconnected. In current systems, these are often managed independently, leading to inefficiencies. For example, compute-bound models benefit from higher power caps to increase clock speeds, while communication-bound models—such as Mixture-of-Experts (MoE) architectures—often see diminishing returns or efficiency losses when power is pushed too high, as the extra energy is spent on communication overhead rather than computation. PALS coordinates these knobs to navigate this complex trade-off, ensuring that the system operates at the most efficient point for any given workload.

How PALS Works

PALS functions as a closed-loop control system that integrates directly into existing frameworks like vLLM without requiring model retraining or API changes. The system operates through three main layers:

  • Telemetry: It continuously monitors real-time GPU power, throughput, and utilization.

  • Control: It uses lightweight offline power-performance models to predict the best configuration. A feedback loop then adjusts these decisions every 500ms to account for workload fluctuations or changing power budgets.

  • Actuation: It dynamically enforces the chosen power caps and batch sizes, allowing the system to adapt to external signals like real-time electricity pricing or facility-level power limits.

Key Performance Results

By treating power as a first-class control primitive, PALS expands the range of achievable performance, reaching operating points that were previously inaccessible. Across various multi-GPU systems and model types, the system achieves:

  • Up to 26.3% improvement in energy efficiency.

  • A 4x to 7x reduction in Quality-of-Service (QoS) violations when operating under strict power constraints.

  • The ability to effectively track and adhere to dynamic power budgets, making it suitable for grid-interactive and carbon-aware computing environments.

Considerations for Deployment

The effectiveness of PALS is highly dependent on the specific characteristics of the model being served. The research highlights that compute-heavy models and communication-heavy models respond differently to power scaling. Because MoE models can become communication-bound as they scale across multiple nodes, PALS is designed to account for these bottlenecks. By providing a plug-and-play solution that does not require modifications to model architecture, PALS offers a practical path for operators to improve the energy proportionality of their AI infrastructure.

Comments (0)

No comments yet

Be the first to share your thoughts!