Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
Training Large Language Models (LLMs) at 4-bit precision offers significant savings in memory and computation, but current industry standards—such as those used in NVIDIA Blackwell and AMD MI350 systems—rely on a format called E2M1. This paper identifies a fundamental flaw in E2M1: it suffers from "Shrinkage Bias," a systematic rounding error caused by the uneven spacing of its representable values. The authors propose a new, more stable training recipe called UFP4, which uses a uniform 4-bit grid to eliminate this bias and improve overall training quality.
The Problem: Shrinkage Bias
The researchers discovered that E2M1, a non-uniform floating-point format, has asymmetric "rounding bins." When a value is rounded to the nearest representable number, these asymmetries consistently push values toward zero. This is not just a minor rounding error; it acts as a systematic signal decay that compounds across every layer of a deep neural network. Because this bias is multiplicative, it causes the model's signal to fade as it passes through the layers, leading to the instability often observed in current FP4 training.
The Role of Random Hadamard Transforms
To handle outliers in training data, many systems use a technique called the Random Hadamard Transform (RHT), which spreads out extreme values to make them easier to process. The study reveals a hidden pitfall: while RHT is intended to help, it actually makes the Shrinkage Bias worse when used with E2M1. By shifting data into the most asymmetric parts of the E2M1 grid, RHT inadvertently amplifies the rounding errors, further destabilizing the training process.
The UFP4 Solution
The authors introduce UFP4, a 4-bit training recipe that replaces the non-uniform E2M1 grid with a uniform grid (E1M2/INT4). Because uniform grids have perfectly symmetric rounding bins, they do not suffer from Shrinkage Bias. This stability allows the researchers to apply RHT across all three major matrix-multiplication paths (forward, data-gradient, and weight-gradient) without the negative side effects seen in previous methods. By restricting stochastic rounding to only the upstream gradient, UFP4 maintains high precision and better preserves the signal throughout the entire training process.
Results and Future Outlook
The UFP4 recipe was tested on a variety of models, including a 1.5B dense model and MoE (Mixture-of-Experts) models up to 124B parameters. In these long-run pretraining experiments, UFP4 consistently achieved lower loss degradation compared to strong E2M1-based baselines. The authors conclude that because uniform grids provide superior quantization quality and stability, future AI hardware accelerators should support E1M2/INT4-style uniform 4-bit grids as a first-class standard alongside the existing E2M1 format.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!