2 min read

How Reinforcement Learning Can Fix LLMs’ Greedy Decision-Making

How Reinforcement Learning Can Fix LLMs’ Greedy Decision-Making

Large Language Models (LLMs) have shown remarkable capabilities in text generation and reasoning tasks, but their performance in decision-making scenarios often falls short. A recent study from researchers at Google DeepMind and JKU Linz systematically examines why LLMs struggle with exploration and decision-making, identifying three key failure modes: greediness, frequency bias, and the knowing-doing gap. The paper, titled LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities, proposes Reinforcement Learning Fine-Tuning (RLFT) as a solution to improve LLMs’ exploration and decision-making abilities.

The Problem: Why LLMs Make Bad Decisions

When deployed as agents, LLMs often exhibit suboptimal behavior in interactive environments. The study highlights three core issues:

  1. Greediness: LLMs prematurely commit to high-reward actions, neglecting exploration. For instance, in a 10-armed bandit problem, even a 27B-parameter model explored only ~65% of possible actions, leaving significant portions of the action space untried.
  2. Frequency Bias: Smaller models (e.g., 2B parameters) tend to mimic the most frequent actions in their context, regardless of reward. Larger models reduce this bias but remain prone to greedy exploitation.
  3. Knowing-Doing Gap: LLMs can describe optimal strategies (e.g., the Upper Confidence Bound algorithm) but fail to execute them. In experiments, models correctly computed UCB values 87% of the time but chose greedy actions 64% of the time even when they ‘knew’ better.

The Fix: Reinforcement Learning Fine-Tuning (RLFT)

The researchers propose RLFT, where LLMs are fine-tuned using reinforcement learning on self-generated Chain-of-Thought (CoT) rationales. The approach incentivizes models to refine their reasoning processes based on environmental rewards. Key findings:

  • Improved Exploration: RLFT increased action coverage by 12-13% in bandit tasks, reducing regret.
  • Mitigated Frequency Bias: The fraction of frequent-but-suboptimal actions dropped from 70% to 35%.
  • Narrowed Knowing-Doing Gap: Models better aligned their actions with their reasoning.

Beyond RLFT: Enhancing Exploration

While RLFT helps, classic exploration techniques like ε-greedy and exploration bonuses (+1 reward for untried actions) further boost performance. The ‘try-all’ strategy—forcing initial exploration of all actions—nearly closed the gap to optimal UCB performance, suggesting LLMs can act near-optimally if given sufficient information.

Practical Implications

The study underscores that:

  1. CoT is Critical: RLFT without CoT performed poorly, highlighting reasoning as a key exploration mechanism.
  2. Expert Data Helps: Supervised fine-tuning on expert trajectories (e.g., UCB rollouts) achieved near-optimal performance, but RLFT offers a scalable alternative when such data is unavailable.
  3. Thinking Time Matters: Allowing more tokens for reasoning improved results, but at higher computational cost.

The Bottom Line

RLFT is a promising direction for building more capable LLM agents, particularly in scenarios requiring exploration and strategic decision-making. The findings also highlight the need for better exploration mechanisms tailored to LLMs, as their default behavior often leads to suboptimal, greedy outcomes.