2 min read

Does Reinforcement Learning Really Expand LLM Reasoning Beyond Base Models?

Does Reinforcement Learning Really Expand LLM Reasoning Beyond Base Models?

Reinforcement Learning with Verifiable Rewards (RLVR) has been hailed as a breakthrough for enhancing reasoning in large language models (LLMs), particularly in math and coding tasks. But a new study from researchers at Tsinghua University and Shanghai Jiao Tong University challenges this assumption, revealing that RL-trained models may not actually develop fundamentally new reasoning abilities—they just get better at sampling existing ones.

The Surprising Findings

The team measured reasoning performance using the pass@k metric—which counts a problem as solved if any of k sampled responses is correct—across multiple model families, RL algorithms, and benchmarks. Their key discovery? While RL-trained models outperform base models at low k (e.g., k=1), base models catch up or even surpass RL models when allowed more samples (k=256 or higher). This suggests RLVR doesn’t expand reasoning boundaries—it just makes models more efficient at finding correct answers already within their base capabilities.

Why This Matters

  1. RLVR Biases, Not Expands, Reasoning Paths
    The study found that RL-trained models don’t generate novel reasoning patterns. Instead, they bias the model’s output distribution toward paths that yield rewards, improving success rates for problems the base model could already solve—just less frequently. However, this comes at a cost: RL-trained models explore less, narrowing their overall reasoning scope.
  2. Distillation Works Differently
    Unlike RLVR, distillation (training smaller models on outputs from stronger ones) does introduce new knowledge, genuinely expanding reasoning boundaries. This highlights a critical limitation of RLVR’s current form.
  3. All RL Algorithms Hit the Same Wall
    Whether using PPO, GRPO, or Reinforce++, RL methods only marginally improved sampling efficiency—none came close to the base model’s upper-bound performance at high k.

The Bigger Picture

These findings force us to rethink RLVR’s role in advancing LLMs. If the goal is true reasoning breakthroughs, RLVR in its current form may be insufficient. The study suggests future work should explore:

  • Hybrid approaches combining RL with distillation or other paradigms.
  • Better exploration strategies to escape the base model’s reasoning limits.
  • Alternative training frameworks that don’t sacrifice coverage for efficiency.

For businesses relying on LLMs for complex reasoning tasks, this research underscores the importance of evaluating models beyond single-sample metrics—what matters isn’t just how often a model gets the right answer, but whether it can ever get it with enough tries.