15 Apr 2025 1 min read

Weight Ensembling Boosts Reasoning in Language Models, Overcoming Diversity Collapse

Large language models (LLMs) have shown remarkable reasoning abilities, but a new study reveals a critical flaw in their training process: diversity collapse. When models are fine-tuned for reasoning tasks, their ability to generate varied solutions deteriorates, even as their accuracy on single attempts (Pass@1) improves. This poses a major problem for real-world applications where sampling multiple solutions (Pass@k) is crucial.

The Problem: Diversity Collapse

Researchers from Tsinghua University, Carnegie Mellon, and Stanford discovered that during supervised fine-tuning (SFT), models increasingly converge on a single reasoning path. While Pass@1 rises, Pass@k—the probability that at least one of k attempts is correct—plummets. This happens because models become overconfident in specific solutions, reducing their exploratory capability.

The Solution: WiSE-FT

The team found a surprisingly simple fix: weight ensembling (WiSE-FT), where the latest model checkpoint is interpolated with an earlier, more diverse one. This hybrid model retains the accuracy of late-stage training while recovering the diversity of earlier stages. Key benefits include:

Better Test-Time Scaling: WiSE-FT outperforms standard models in majority voting and reward-guided search.
Improved Reinforcement Learning: Starting RL from a WiSE-FT checkpoint yields better results with less data.
Bias-Variance Tradeoff Mastery: Unlike temperature scaling, which trades bias for variance, WiSE-FT reduces both simultaneously.

Why It Matters

This work challenges conventional wisdom that decoding strategies (like temperature scaling) are sufficient for diverse reasoning. Instead, it shows that diversity must be baked into the training process. For businesses leveraging LLMs in complex problem-solving, WiSE-FT offers a scalable way to enhance performance without additional computational cost.

Practical Takeaways

Train longer, then ensemble: Don’t stop SFT early—instead, interpolate weights with an early checkpoint.
RL loves diversity: WiSE-FT provides richer training signals for reinforcement learning.
Decoding isn’t enough: Token-level tricks can’t fully compensate for lost diversity during training.

The findings suggest a new paradigm for reasoning models: train aggressively, ensemble wisely, and scale compute intelligently. For AI teams, this could mean faster iteration and more reliable outputs in math, coding, and logical reasoning tasks.