Rethinking Shrinkage Bias in LLM FP4 Pretraining: G...

Rethinking Shrinkage Bias in LLM FP4 Pretraining: G... | AI Research

Key Takeaways

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe Training Large Language Models (LLMs) at 4-bit precision...
In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality.
Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone.
On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies.
Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.

Paper AbstractExpand

FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
Training Large Language Models (LLMs) at 4-bit precision offers significant savings in memory and computation, but current industry standards—such as those used in NVIDIA Blackwell and AMD MI350 systems—rely on a format called E2M1. This paper identifies a fundamental flaw in E2M1: it suffers from "Shrinkage Bias," a systematic rounding error caused by the uneven spacing of its representable values. The authors propose a new, more stable training recipe called UFP4, which uses a uniform 4-bit grid to eliminate this bias and improve overall training quality.

The Problem: Shrinkage Bias

The researchers discovered that E2M1, a non-uniform floating-point format, has asymmetric "rounding bins." When a value is rounded to the nearest representable number, these asymmetries consistently push values toward zero. This is not just a minor rounding error; it acts as a systematic signal decay that compounds across every layer of a deep neural network. Because this bias is multiplicative, it causes the model's signal to fade as it passes through the layers, leading to the instability often observed in current FP4 training.

The Role of Random Hadamard Transforms

To handle outliers in training data, many systems use a technique called the Random Hadamard Transform (RHT), which spreads out extreme values to make them easier to process. The study reveals a hidden pitfall: while RHT is intended to help, it actually makes the Shrinkage Bias worse when used with E2M1. By shifting data into the most asymmetric parts of the E2M1 grid, RHT inadvertently amplifies the rounding errors, further destabilizing the training process.

The UFP4 Solution

The authors introduce UFP4, a 4-bit training recipe that replaces the non-uniform E2M1 grid with a uniform grid (E1M2/INT4). Because uniform grids have perfectly symmetric rounding bins, they do not suffer from Shrinkage Bias. This stability allows the researchers to apply RHT across all three major matrix-multiplication paths (forward, data-gradient, and weight-gradient) without the negative side effects seen in previous methods. By restricting stochastic rounding to only the upstream gradient, UFP4 maintains high precision and better preserves the signal throughout the entire training process.

Results and Future Outlook

The UFP4 recipe was tested on a variety of models, including a 1.5B dense model and MoE (Mixture-of-Experts) models up to 124B parameters. In these long-run pretraining experiments, UFP4 consistently achieved lower loss degradation compared to strong E2M1-based baselines. The authors conclude that because uniform grids provide superior quantization quality and stability, future AI hardware accelerators should support E1M2/INT4-style uniform 4-bit grids as a first-class standard alongside the existing E2M1 format.