20 May 2025 2 min read

RISE: How Self-Verification is Reinforcing AI’s Problem-Solving Skills

Large language models (LLMs) have shown impressive reasoning abilities, but they often struggle with one critical flaw: superficial self-reflection. Even when these models generate correct answers, they frequently fail to robustly verify their own outputs—a problem that undermines trust and reliability in AI systems. A new framework called RISE (Reinforcing Reasoning with Self-Verification) aims to fix this by integrating self-verification directly into reinforcement learning (RL) training.

The Problem: Superficial Self-Reflection

Reinforcement Learning with Verifiable Rewards (RLVR) has been a key method for improving LLMs in domains like mathematics, where correctness can be programmatically checked. However, models trained this way often learn to produce correct answers without truly understanding their reasoning—or being able to critique it. This leads to “superficial self-reflection”, where models generate plausible-sounding justifications but can’t reliably identify their own mistakes.

The Solution: RISE

Developed by researchers at Tencent and The Chinese University of Hong Kong, Shenzhen, RISE trains models to both solve problems and verify their solutions in a single, unified RL process. The key innovation? Using verifiable rewards not just to improve answer accuracy, but also to teach the model to critique its own work on the fly.

Here’s how it works:

Problem-Solving Phase: The model generates multiple solutions (with chain-of-thought reasoning) for a given problem.
Self-Verification Phase: The same model then evaluates its own solutions, assigning a correctness score based on predefined criteria (e.g., whether the answer is boxed, matches the ground truth, etc.).
RL Optimization: Both the problem-solving and verification trajectories contribute to policy updates, ensuring the model improves at both tasks simultaneously.

Results: Better Reasoning, Stronger Verification

Experiments on mathematical reasoning benchmarks (MATH, AIME, AMC) show that RISE significantly outperforms baseline methods:

Higher Accuracy: RISE-7B achieved a 42.9% average reasoning accuracy—a massive jump over the 11.3% of standard supervised fine-tuning (SFT) models.
Better Self-Verification: RISE models were up to 2.8× more accurate at verifying their own solutions than non-RISE models.
Test-Time Gains: When using self-verification to weight majority voting, RISE-7B improved accuracy by +1.9% over standard voting.

Why This Matters

RISE isn’t just about making models better at math—it’s about making them more reliable and self-aware. By learning to verify their own reasoning, models can:

Detect errors before they output incorrect answers.
Improve inference-time performance by filtering out bad solutions.
Scale to harder problems where human feedback isn’t available.

The framework is also flexible: it works with different RL algorithms (tested with PPO) and could extend to other domains with verifiable rewards, like code generation or scientific reasoning.

The Bigger Picture

As AI systems take on more complex tasks, self-verification will be crucial for trust and safety. RISE is a step toward models that don’t just generate answers—they validate them. The next frontier? Applying this to open-ended domains where correctness isn’t as easily defined.

For more details, check out the full paper on arXiv.