16 Jun 2025 2 min read

How Strategic Games Are Revealing the Hidden Reasoning Processes of Large Language Models

Large language models (LLMs) are increasingly being used for complex reasoning tasks, but most benchmarks only evaluate the final outcomes, not the internal processes that lead to those outcomes. A new study proposes using strategic games to make these reasoning processes observable and measurable, offering a fresh perspective on how LLMs plan, revise, and make decisions under constraints.

The Problem with Traditional Benchmarks

Traditional benchmarks like GSM8K or MMLU focus on single-turn questions and measure correctness in isolation. They provide little insight into how a model generates hypotheses, updates them in response to feedback, or adjusts its strategy over time. This gap is problematic for real-world applications where the quality of reasoning depends not just on the final answer but on the steps taken to get there.

A Game-Based Solution

The study introduces AdvGameBench, a framework that embeds LLMs in interactive, resource-constrained strategy games. These games—tower defense, auto battler, and turn-based combat—are designed to expose different cognitive and strategic demands. The framework logs full model outputs and action traces, enabling detailed inspection of decision quality, revision behavior, and adherence to constraints.

Key Metrics

The study evaluates LLMs along three core dimensions:

Planning: How well a model formulates an initial strategy.
Revision: How effectively it corrects mistakes in response to feedback.
Resource-Constrained Decision Making: How well it operates under strict budget limits.

To measure these, the study introduces metrics like:

Over-correction Risk Rate (ORR): How often a model revises its strategy unnecessarily.
Correction Success Rate (CSR): How often revisions lead to improved outcomes.
Improvement Slope (β): Whether a model learns from repeated interactions.
Over-Budget Rate (OBR): How often a model exceeds resource constraints.

Findings from 4,320 Adversarial Rounds

The study tested 12 state-of-the-art models, including variants of ChatGPT, Claude, Gemini, and Qwen. Key findings include:

ChatGPT-o3-mini demonstrated strong planning capabilities, achieving the highest composite process score (74.7% win rate, 78.6% correction success, and a +0.041 improvement slope).
Qwen-Plus, despite a high Over-correction Risk Score of 81.6%, won only 25.6% of its matches, primarily due to excessive resource use.
There was a negative correlation between Over-correction Risk Rate and correction success rate (Pearson r = –0.51), suggesting that more frequent corrections don’t always improve outcomes.

Implications for LLM Development

The study highlights that top-performing models balance planning, revision, and constraint adherence—not just excel in one area. For example, models like ChatGPT-o3-mini revise less frequently but more effectively, while others like Qwen-Plus make frequent, often ineffective corrections.

Limitations and Future Work

AdvGameBench currently covers only turn-based genres and relies on synthetic opponents. Future work could expand to real-time or cooperative play and incorporate human opponents for greater external validity.

Conclusion

By shifting the evaluation paradigm from static, outcome-based tests to dynamic, process-aware environments, this study offers a new direction for LLM evaluation. Understanding not just what models decide, but how they decide it, is essential for building reliable, accountable, and aligned AI systems.