Back to AI Research

AI Research

Introducing Background Temperature to Characterise... | AI Research

Key Takeaways

  • Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models Even when developers set a Large Language Model (LLM) to a temp...
  • Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs.
  • Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity.
  • We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.
  • Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models
Paper AbstractExpand

Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emph{background temperature} $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.

Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models
Even when developers set a Large Language Model (LLM) to a temperature of $T=0$—the standard setting for "greedy" or deterministic output—the model may still produce different responses to the same input. This paper formalizes this persistent, unintended variability by introducing the concept of "background temperature" ($T_{\mathrm{bg}}$). The authors argue that this hidden randomness is not a flaw in the model's logic, but rather a byproduct of the underlying computing environment, such as how hardware processes data, the specific kernels used, and floating-point math operations. By quantifying this effect, the researchers aim to provide a standardized way to measure and eventually mitigate the unpredictability of LLM deployments.

Defining Background Temperature

The paper models the nondeterminism of an LLM as a stochastic perturbation that occurs during the inference process. Even when a user requests $T=0$, the actual system environment—which includes factors like batch size, hardware precision, and parallel processing order—acts as a hidden source of noise. The authors define $T_{\mathrm{bg}}$ as the "equivalent temperature" of an ideal, environment-free system that would produce the same level of output variability as the real-world system. Essentially, $T_{\mathrm{bg}}$ represents the "floor" of randomness that exists in a deployment stack regardless of the user's settings.

A Protocol for Measurement

To estimate $T_{\mathrm{bg}}$, the authors propose a practical empirical protocol. Since a perfectly deterministic "ideal" system is often unavailable, the protocol suggests using a stable, local "anchor" model as a reference. By running the same prompts through both the system under test and the stable reference model at various known temperatures, researchers can compare the resulting output distributions. Using statistical divergence metrics—such as the Kolmogorov–Smirnov distance—the team can identify which temperature setting on the reference model most closely matches the variability observed in the target system. This mapping allows them to assign an equivalent temperature value to the target system's hidden noise.

Implications for Deployment

The researchers emphasize that understanding $T_{\mathrm{bg}}$ is critical for reproducibility, evaluation, and reliable deployment. If a system has a high background temperature, it may fail to provide consistent results, which is problematic for applications requiring strict determinism, such as code generation or automated reasoning. The paper suggests that once $T_{\mathrm{bg}}$ is quantified, engineers can take targeted steps to reduce it. These interventions include using batch-invariant kernels, enforcing deterministic reduction orders, and capping concurrency to minimize the impact of the inference environment on the model's output.

Pilot Findings

In a pilot experiment using the gpt-4.1-nano model, the researchers applied their protocol to estimate its background temperature. By comparing the model's output variability against a reference model (SmolLM3-3B) across a range of temperatures, they were able to observe how the distribution of identical answers shifted. The results demonstrated that the target model exhibited a measurable level of nondeterminism even at $T=0$, confirming that the concept of background temperature provides a viable framework for characterizing and eventually controlling the hidden randomness inherent in modern LLM infrastructure.

Comments (0)

No comments yet

Be the first to share your thoughts!