2 min read

ComPO: A New Approach to Aligning AI with Human Preferences Using Noisy Data

ComPO: A New Approach to Aligning AI with Human Preferences Using Noisy Data

Large language models (LLMs) have become indispensable tools across various sectors, but aligning them with human preferences remains a critical challenge. Traditional methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) often struggle with issues like verbosity and likelihood displacement, where models generate overly long responses or unintentionally shift probability mass to undesired outputs.

A new paper titled "ComPO: Preference Alignment via Comparison Oracles" introduces a novel approach to address these limitations. The researchers propose a method that leverages comparison oracles to extract meaningful signals from noisy preference pairs—data where the preferred and dispreferred responses are similar in likelihood. This is a common issue in existing datasets, where weak or ambiguous preference signals can lead to suboptimal alignment.

Key Contributions

  1. Comparison Oracle-Based Alignment: ComPO treats preference pairs as oracle outputs, guiding model updates without relying on an explicit proxy objective. This allows the method to effectively utilize even noisy data, which often contains valuable but overlooked information.
  2. Convergence Guarantees: The authors provide theoretical guarantees for their method under non-convex, smooth settings, ensuring robust performance.
  3. Practical Enhancements: To make the method scalable, the team introduces techniques like gradient clipping and output-layer weight perturbations, reducing computational overhead while maintaining effectiveness.

Experimental Results

ComPO was tested across multiple models (Mistral-7B, Llama-3-8B, and Gemma-2-9B) and benchmarks (AlpacaEval 2, MT-Bench, and Arena-Hard). The results show significant improvements in length-controlled win rates (indicating less verbosity) and mitigation of likelihood displacement. For instance, ComPO-augmented models consistently outperformed baseline DPO and SimPO variants, particularly in generating concise and high-quality responses.

Why This Matters

ComPO’s ability to harness noisy preference data complements recent findings by Razin et al. (2025), who highlighted the importance of filtering similar preference pairs. However, ComPO goes a step further by actively leveraging this data to improve alignment, offering a more flexible and efficient solution.

Future Directions

The authors suggest extending ComPO to other alignment tasks, such as safety and reasoning, and exploring its potential in diffusion model alignment. The method’s lightweight fine-tuning approach also opens doors for broader applications in resource-constrained environments.

For more details, check out the full paper on arXiv.