17 May 2025 2 min read

Neural Thermodynamic Laws: A New Framework for Understanding LLM Training Dynamics

Large language models (LLMs) are often described as black boxes, with their training dynamics governed by empirical observations rather than fundamental laws. But a new paper from researchers at MIT proposes a surprising connection: the training of LLMs may follow principles analogous to the laws of thermodynamics.

In Neural Thermodynamic Laws for Large Language Model Training, Ziming Liu and colleagues introduce a framework where key thermodynamic quantities—temperature, entropy, heat capacity, and thermal conduction—emerge naturally from the dynamics of LLM training. The work provides not just a theoretical curiosity but practical insights, particularly for designing learning rate schedules.

The River-Valley Landscape

The paper builds on recent observations that LLM loss landscapes resemble "river-valleys": flat, slow-changing directions (rivers) interspersed with sharp, fast-changing directions (valleys). The fast directions equilibrate quickly, while the slow directions evolve gradually—a separation that mirrors quasi-static processes in thermodynamics.

Under this framework, the learning rate (η) plays the role of temperature, controlling the "thermal" fluctuations in the valley directions. The fast loss component (ℓf) behaves like thermal energy, obeying an equipartition-like principle where it becomes independent of the sharpness of individual directions. Meanwhile, the slow loss (ℓs) corresponds to macroscopic work done along the river.

Practical Implications: Learning Rate as Temperature

One of the most actionable insights is the relationship between learning rate and temperature. The paper shows that:

Higher η → Higher "temperature": Larger learning rates increase fluctuations in sharp directions, akin to heating a system.
Optimal annealing requires η ∝ 1/t: Just as cooling a system too quickly can trap it in non-equilibrium states, decaying the learning rate too fast harms convergence. The derived optimal schedule decays as η(t) ≈ η₀/(1 + t/tₕ), where tₕ depends on sharpness and noise.
Entropic forces emerge: Sharpening valleys during training creates an entropic force that resists optimization, analogous to thermodynamic systems favoring states with higher entropy.

Experiments Validate the Theory

The team tested these ideas on GPT-2 training, finding that:

The final validation loss scales linearly with the minimum learning rate (η_min), matching the thermal loss prediction.
The 1/t decay schedule outperforms common alternatives like linear or cosine decay.
Entropic forces, while small in early training, could become significant in longer runs as valleys sharpen.

Why This Matters

This work shifts LLM optimization from a largely empirical endeavor to one grounded in mechanistic principles. By framing training dynamics in thermodynamic terms, it offers:

Better learning rate schedules: The 1/t decay law provides a principled alternative to heuristic schedules.
Deeper understanding of scaling: The thermal analogy suggests why larger models might need careful annealing—sharp directions accumulate "heat" that must be dissipated.
A bridge to physics: The framework opens the door to applying tools from statistical mechanics to deep learning, potentially unlocking new optimization techniques.

As LLMs grow larger and more expensive to train, such theoretical insights could prove invaluable—turning the art of training into more of a science.