30 May 2025 3 min read

How AI Models Can Improve Their Reasoning by Just Being More Confident

In the ever-evolving landscape of artificial intelligence, one of the most persistent challenges has been improving the reasoning capabilities of large language models (LLMs). Traditionally, this has required extensive supervision—feeding models labeled data or correct answers to learn from. But what if models could get better at reasoning without any external feedback? A new paper from researchers at Carnegie Mellon University suggests they can—by simply learning to be more confident in their answers.

The Problem with Supervision

Reinforcement learning (RL) has been a cornerstone in advancing AI, particularly in reasoning tasks like math, science, and coding. But RL relies heavily on reward functions—signals that tell the model whether it’s doing well. Crafting these rewards is notoriously difficult, especially in open-ended or real-world scenarios where ground-truth answers aren’t available. Current methods often require human-labeled data or predefined correct answers, limiting their scalability and applicability.

A Radical Idea: Confidence as Reward

The CMU team proposes a novel approach called RENT (Reinforcement Learning via Entropy Minimization), which sidesteps the need for external supervision entirely. Instead of relying on correct answers, RENT uses the model’s own confidence—measured by the entropy of its token predictions—as an intrinsic reward. Lower entropy means the model is more certain about its output; higher entropy means it’s less sure. By reinforcing low-entropy (high-confidence) responses, the model gradually improves its reasoning abilities.

“Imagine you’re taking an exam. With no external help, you rely on your own confidence to guide your reasoning. Similarly, we let the model optimize for confidence—its own internal signal of certainty.” — Mihir Prabhudesai, lead author.

How It Works

When a language model generates text, it predicts a probability distribution over possible next tokens at each step. The entropy of this distribution quantifies uncertainty: a flat distribution (high entropy) means the model is unsure; a peaked one (low entropy) means it’s confident. RENT computes the average entropy across all tokens in a response and uses its negative as the reward. The model then learns to generate responses with lower entropy—effectively, more confident answers.

Crucially, the researchers found that confidence in the final steps of reasoning matters most. Early tokens in a response showed little correlation with accuracy, but tokens near the end—especially those corresponding to the final answer—were strongly predictive of correctness. This mirrors human problem-solving: we revise our reasoning until we feel certain about the conclusion.

Benchmarking RENT

The team tested RENT across five challenging reasoning benchmarks:

GSM8K (grade-school math)
MATH500 (competition-level math)
AMC/AIME (high school math olympiads)
GPQA (PhD-level science questions)

They evaluated models from the Mistral and Qwen families, ranging from 1.5B to 7B parameters. Across the board, RENT improved performance—sometimes dramatically. For example:

Qwen2.5-Math-1.5B went from 0% to 15.9% accuracy on GSM8K.
Qwen2.5-7B-Instruct improved from 76.2% to 82.3% on MATH500.
Even specialized math models like Qwen2.5-Math-7B saw gains, jumping from 65.2% to 82.7% on MATH500.

Is It Just Learning to Format Answers?

One concern was whether RENT simply teaches models to format answers correctly (e.g., boxing final answers) rather than improving reasoning. To test this, the researchers compared RENT to a baseline that only rewarded proper formatting. RENT consistently outperformed it, confirming that the model wasn’t just “gaming” the format but actually refining its reasoning.

Confidence vs. Correctness

A key question is whether confidence reliably correlates with accuracy. The team found that, yes, as models became more confident via RENT, their accuracy also improved. This suggests that, at least in reasoning tasks, confidence is a well-calibrated signal—though the authors caution that overconfidence remains a risk in open-ended settings.

Why This Matters

RENT opens up new possibilities for unsupervised RL in AI. It’s especially promising for domains where labeled data is scarce, such as advanced mathematics, scientific research, or real-world decision-making. The method is also computationally lightweight, requiring no additional reward models or human feedback.

Limitations

Of course, unsupervised learning has limits. RENT can’t match the performance of models trained with ground-truth answers, and overconfidence could lead to errors in high-stakes scenarios. The researchers emphasize that safeguards are needed before deploying such systems in the wild.

The Future of Self-Improving AI

RENT is part of a growing trend toward self-improving AI systems. By leveraging intrinsic signals like confidence, models can refine their abilities without constant human oversight. The CMU team is now exploring how to combine RENT with other unsupervised techniques—potentially unlocking even greater reasoning gains.

For businesses, this research underscores a critical insight: sometimes, the best way to improve AI isn’t more data, but better ways for models to learn from themselves.

Read the full paper here.