Google Gemma 4 Gains MTP Drafters for 3x Faster Inference

Google has officially released Multi-Token Prediction (MTP) drafters for the Gemma 4 model family, a significant advancement designed to address the persistent latency challenges inherent in large language model (LLM) deployment. By utilizing a specialized speculative decoding architecture, these new drafters enable up to 3x faster inference speeds while maintaining the full reasoning accuracy and output quality of the target models.

Overcoming the Memory-Bandwidth Bottleneck

The primary obstacle to LLM performance is the autoregressive nature of current models, which generate tokens sequentially. Because every single token requires loading billions of parameters from video RAM (VRAM) into compute units, the process is constrained by memory bandwidth rather than raw processing power. This creates a bottleneck where compute resources remain underutilized while the system is occupied with data transfer, regardless of the hardware's capabilities.

The Mechanics of Speculative Decoding

Gemma 4’s MTP drafters resolve this inefficiency by decoupling token generation from verification. The system pairs a lightweight drafter model with a larger target model. The drafter proposes a sequence of future tokens in rapid succession, which the target model then verifies in parallel during a single forward pass. If the target model accepts the draft, the application outputs the entire sequence plus one additional token in the time it would normally take to generate a single token. Because the target model performs the final verification, the output remains identical to a standard autoregressive generation, ensuring a lossless speedup.

Architectural Enhancements for Efficiency

Google has optimized the Gemma 4 MTP drafters by allowing them to share the target model’s activations and key-value (KV) cache. By utilizing this shared cache, the drafter avoids the redundant recomputation of context that the larger model has already processed. Furthermore, for the E2B and E4B edge models, Google implemented an efficient clustering technique in the embedder layer to accelerate final logit calculations. This specifically targets performance bottlenecks on mobile and edge hardware.
Performance gains vary based on hardware and configuration. For the Gemma 4 26B mixture-of-experts (MoE) model, increasing the batch size to between 4 and 8 on Apple Silicon or NVIDIA A100 hardware can unlock up to a 2.2x speedup. The MTP drafters are currently available under the Apache 2.0 license, with model weights accessible on Hugging Face and Kaggle.

Google Gemma 4 Gains MTP Drafters for 3x Faster Inference

Key Takeaways

Overcoming the Memory-Bandwidth Bottleneck

The Mechanics of Speculative Decoding

Architectural Enhancements for Efficiency

Comments (0)

No comments yet

Google Gemma 4 Gains MTP Drafters for 3x Faster Inference

Key Takeaways

Overcoming the Memory-Bandwidth Bottleneck

The Mechanics of Speculative Decoding

Architectural Enhancements for Efficiency

Get a Free AI Prompt Guide

Comments (0)

No comments yet