How Reinforcement Learning Can Fix LLMs’ Greedy Decision-Making
Large Language Models (LLMs) have shown remarkable capabilities in text generation and reasoning tasks, but their performance in decision-making scenarios often falls short. A recent study from researchers at Google DeepMind and JKU Linz systematically examines why LLMs struggle with exploration and decision-making, identifying three key failure modes: greediness, frequency bias, and the knowing-doing gap. The paper, titled LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities, proposes Reinforcement Learning Fine-Tuning (RLFT) as a solution to improve LLMs’ exploration and decision-making abilities.
The Problem: Why LLMs Make Bad Decisions
When deployed as agents, LLMs often exhibit suboptimal behavior in interactive environments. The study highlights three core issues:
- Greediness: LLMs prematurely commit to high-reward actions, neglecting exploration. For instance, in a 10-armed bandit problem, even a 27B-parameter model explored only ~65% of possible actions, leaving significant portions of the action space untried.
- Frequency Bias: Smaller models (e.g., 2B parameters) tend to mimic the most frequent actions in their context, regardless of reward. Larger models reduce this bias but remain prone to greedy exploitation.
- Knowing-Doing Gap: LLMs can describe optimal strategies (e.g., the Upper Confidence Bound algorithm) but fail to execute them. In experiments, models correctly computed UCB values 87% of the time but chose greedy actions 64% of the time even when they ‘knew’ better.
The Fix: Reinforcement Learning Fine-Tuning (RLFT)
The researchers propose RLFT, where LLMs are fine-tuned using reinforcement learning on self-generated Chain-of-Thought (CoT) rationales. The approach incentivizes models to refine their reasoning processes based on environmental rewards. Key findings:
- Improved Exploration: RLFT increased action coverage by 12-13% in bandit tasks, reducing regret.
- Mitigated Frequency Bias: The fraction of frequent-but-suboptimal actions dropped from 70% to 35%.
- Narrowed Knowing-Doing Gap: Models better aligned their actions with their reasoning.
Beyond RLFT: Enhancing Exploration
While RLFT helps, classic exploration techniques like ε-greedy and exploration bonuses (+1 reward for untried actions) further boost performance. The ‘try-all’ strategy—forcing initial exploration of all actions—nearly closed the gap to optimal UCB performance, suggesting LLMs can act near-optimally if given sufficient information.
Practical Implications
The study underscores that:
- CoT is Critical: RLFT without CoT performed poorly, highlighting reasoning as a key exploration mechanism.
- Expert Data Helps: Supervised fine-tuning on expert trajectories (e.g., UCB rollouts) achieved near-optimal performance, but RLFT offers a scalable alternative when such data is unavailable.
- Thinking Time Matters: Allowing more tokens for reasoning improved results, but at higher computational cost.
The Bottom Line
RLFT is a promising direction for building more capable LLM agents, particularly in scenarios requiring exploration and strategic decision-making. The findings also highlight the need for better exploration mechanisms tailored to LLMs, as their default behavior often leads to suboptimal, greedy outcomes.