3 min read

DPO vs. GRPO: A Deep Dive into Reinforcement Learning for Autoregressive Image Generation

DPO vs. GRPO: A Deep Dive into Reinforcement Learning for Autoregressive Image Generation

The world of AI-powered image generation is evolving rapidly, and reinforcement learning (RL) is playing an increasingly pivotal role in shaping how models create visuals. A new study from researchers at CUHK, Shanghai AI Lab, and Peking University dives deep into two leading RL algorithms—Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO)—to uncover their strengths, weaknesses, and the nuances of applying them to autoregressive image generation.

The Battle of RL Algorithms: DPO vs. GRPO

At the heart of this research is a comparison between DPO and GRPO, two RL techniques that have shown promise in fine-tuning large language models (LLMs) for tasks requiring Chain-of-Thought (CoT) reasoning. But how do they fare when applied to image generation, which also involves a sequential, step-by-step process?

  • DPO: Known for its efficiency, DPO optimizes preferences directly without needing a separate reward model during training. It’s fast and computationally cheaper but can struggle with complex reasoning tasks due to its reliance on static, pre-collected data.
  • GRPO: An on-policy method that iteratively refines its approach using self-generated data. It excels in handling intricate tasks but comes with higher computational costs and longer training times.

The study reveals that DPO outperforms GRPO in in-domain evaluations, achieving an average 11.53% improvement on the T2I-CompBench benchmark. However, GRPO shines in out-of-domain generalization, consistently delivering better results on the GenEval dataset, which tests robustness with simpler, templated prompts.

The Role of Reward Models

One of the most intriguing findings is how the choice of reward model impacts performance. The researchers tested several reward models, including:

  • Human Preference Models (HPS, ImageReward): Trained on human-annotated rankings to assess aesthetic appeal and text-image alignment.
  • Visual Question Answering Models (UnifiedReward): Leverage multimodal LLMs to evaluate images based on detailed reasoning.
  • Metric Rewards: Domain-specific evaluation tools tailored to specific attributes like color or spatial relationships.

The results show that DPO is more sensitive to reward model variations than GRPO, with its generalization performance fluctuating more dramatically based on the reward model used. Crucially, the study found that a reward model with strong intrinsic generalization capabilities can enhance the generalization potential of the RL algorithm itself—a key insight for future model development.

Scaling Strategies: What Works Best?

The researchers also explored three scaling strategies to optimize performance:

  1. Scaling Sampled Images per Prompt: Increasing the number of images generated per prompt (for DPO) or the group size (for GRPO) can improve in-domain performance, but excessive scaling risks overfitting.
  2. Scaling In-Domain Training Data: Expanding the diversity and volume of training data helps both algorithms, but GRPO benefits more from moderate scaling, while DPO sees consistent gains.
  3. Iterative Training: DPO’s in-domain performance improves with iterative training, but its generalization can degrade after multiple cycles. GRPO, meanwhile, maintains steadier out-of-domain performance.

Key Takeaways

  • DPO is the go-to for in-domain tasks, offering strong performance with lower computational overhead.
  • GRPO is better for generalization, making it ideal for applications where adaptability to diverse prompts is critical.
  • Reward model choice matters, especially for DPO, and investing in high-quality, generalizable reward models can pay dividends.
  • Scaling requires balance: More data and samples help, but too much can lead to diminishing returns or overfitting.

Why This Matters

Autoregressive image generation is increasingly being framed as a CoT reasoning problem, where each step in the generation process builds on the last. Understanding how RL algorithms like DPO and GRPO perform in this context—and how factors like reward models and scaling strategies influence outcomes—is crucial for advancing the field. This study not only provides a comprehensive comparison but also offers practical insights for developers looking to fine-tune their models for specific use cases.

For those eager to dive deeper, the full paper is available on arXiv, and the code has been released on GitHub.