02 Jun 2025 2 min read

ProRL: How Prolonged Reinforcement Learning is Expanding the Reasoning Boundaries of AI Models

In a groundbreaking study from NVIDIA, researchers have introduced ProRL (Prolonged Reinforcement Learning), a novel training methodology that challenges prevailing assumptions about the limits of reinforcement learning (RL) in large language models (LLMs). The team demonstrates that extended RL training can uncover novel reasoning strategies inaccessible to base models—even under extensive sampling—effectively expanding what these models can do.

The Big Question: Does RL Really Improve Reasoning?

Recent advances in reasoning-centric language models have highlighted RL as a promising method for aligning models with verifiable rewards. But a fundamental debate persists: Does RL truly expand a model’s reasoning capabilities, or does it merely amplify high-reward outputs already latent in the base model’s distribution? Previous studies argued that RL-trained models don’t acquire new reasoning skills beyond what exists in their base models, based on pass@k metrics. However, the NVIDIA team posits that these conclusions stem from methodological constraints—not fundamental limitations of RL.

Introducing ProRL

The researchers identify two key limitations in existing RL approaches:

Overreliance on specialized domains (like mathematics), where models are often overtrained, restricting exploration.
Premature termination of RL training, typically after just hundreds of steps, before models can fully develop new reasoning capabilities.

To address these, they introduce ProRL, which incorporates:

KL divergence control to prevent entropy collapse (where models become overly narrow in their outputs).
Reference policy resetting to stabilize long-term training.
A diverse suite of tasks (math, code, STEM, logic puzzles, and instruction following) to encourage generalization.

Key Findings

Performance Gains Across the Board

The resulting model, Nemotron-Research-Reasoning-Qwen-1.5B, outperforms its base model (DeepSeek-R1-1.5B) by 14.7% on math, 13.9% on coding, 54.8% on logic puzzles, 25.1% on STEM reasoning, and 18.1% on instruction-following tasks.
Remarkably, it even matches or surpasses the performance of DeepSeek-R1-7B, a model nearly five times its size.

RL Discovers Genuinely New Solutions

On tasks where the base model fails entirely—regardless of sampling attempts—the ProRL-trained model achieves 100% pass rates.
The team quantifies reasoning novelty using the Creativity Index, showing that prolonged RL leads to higher originality in solutions.

Scaling with Compute

Unlike prior work, ProRL shows continued performance improvements even after 2,000 training steps, suggesting RL training scales effectively with increased compute.

Why This Matters for Business

These findings have significant implications for AI deployment in enterprise settings:

Smaller, More Efficient Models: A 1.5B-parameter model achieving performance comparable to larger models reduces computational costs for businesses.
Generalization to New Tasks: ProRL’s ability to handle out-of-distribution tasks suggests robustness in real-world applications where input variability is high.
No Additional Data Needed: RL can enhance capabilities without requiring new training data, making it cost-effective.

The Future of RL in AI

The study challenges the notion that RL merely optimizes existing behaviors, showing instead that it can genuinely expand a model’s reasoning boundaries. This opens new pathways for developing more capable and generalizable AI systems—particularly in domains like automated problem-solving, code generation, and complex decision-making.

For businesses, the message is clear: RL isn’t just a fine-tuning tool—it’s a way to unlock new AI potential.

Model weights and further details are available on Hugging Face.