Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models
Even when developers set a Large Language Model (LLM) to a temperature of $T=0$—the standard setting for "greedy" or deterministic output—the model may still produce different responses to the same input. This paper formalizes this persistent, unintended variability by introducing the concept of "background temperature" ($T_{\mathrm{bg}}$). The authors argue that this hidden randomness is not a flaw in the model's logic, but rather a byproduct of the underlying computing environment, such as how hardware processes data, the specific kernels used, and floating-point math operations. By quantifying this effect, the researchers aim to provide a standardized way to measure and eventually mitigate the unpredictability of LLM deployments.
Defining Background Temperature
The paper models the nondeterminism of an LLM as a stochastic perturbation that occurs during the inference process. Even when a user requests $T=0$, the actual system environment—which includes factors like batch size, hardware precision, and parallel processing order—acts as a hidden source of noise. The authors define $T_{\mathrm{bg}}$ as the "equivalent temperature" of an ideal, environment-free system that would produce the same level of output variability as the real-world system. Essentially, $T_{\mathrm{bg}}$ represents the "floor" of randomness that exists in a deployment stack regardless of the user's settings.
A Protocol for Measurement
To estimate $T_{\mathrm{bg}}$, the authors propose a practical empirical protocol. Since a perfectly deterministic "ideal" system is often unavailable, the protocol suggests using a stable, local "anchor" model as a reference. By running the same prompts through both the system under test and the stable reference model at various known temperatures, researchers can compare the resulting output distributions. Using statistical divergence metrics—such as the Kolmogorov–Smirnov distance—the team can identify which temperature setting on the reference model most closely matches the variability observed in the target system. This mapping allows them to assign an equivalent temperature value to the target system's hidden noise.
Implications for Deployment
The researchers emphasize that understanding $T_{\mathrm{bg}}$ is critical for reproducibility, evaluation, and reliable deployment. If a system has a high background temperature, it may fail to provide consistent results, which is problematic for applications requiring strict determinism, such as code generation or automated reasoning. The paper suggests that once $T_{\mathrm{bg}}$ is quantified, engineers can take targeted steps to reduce it. These interventions include using batch-invariant kernels, enforcing deterministic reduction orders, and capping concurrency to minimize the impact of the inference environment on the model's output.
Pilot Findings
In a pilot experiment using the gpt-4.1-nano model, the researchers applied their protocol to estimate its background temperature. By comparing the model's output variability against a reference model (SmolLM3-3B) across a range of temperatures, they were able to observe how the distribution of identical answers shifted. The results demonstrated that the target model exhibited a measurable level of nondeterminism even at $T=0$, confirming that the concept of background temperature provides a viable framework for characterizing and eventually controlling the hidden randomness inherent in modern LLM infrastructure.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!