ALE-Bench: A New Benchmark for Evaluating AI’s Long-Horizon Algorithm Engineering Skills
ALE-Bench: Pushing AI to Solve Real-World Optimization Problems
In the rapidly evolving field of AI, benchmarks that once seemed challenging are quickly becoming obsolete as models achieve near-human performance. To keep pace, researchers from Sakana AI, the University of Tokyo, and AtCoder have introduced ALE-Bench, a new benchmark designed to evaluate AI systems on long-horizon, score-based algorithmic programming contests. Unlike traditional coding benchmarks that focus on short, pass/fail tasks, ALE-Bench tackles computationally hard optimization problems—like package-delivery routing, crew scheduling, and power-grid balancing—that have no known exact solutions.
Why ALE-Bench Matters
ALE-Bench draws from real tasks in the AtCoder Heuristic Contests (AHC), where human participants spend weeks iteratively refining solutions to push scores higher. These problems mirror real-world industrial challenges, making the benchmark a practical measure of AI’s ability to assist in complex decision-making. The benchmark also probes advanced reasoning capabilities, testing whether AI can exhibit the kind of long-horizon, trial-and-error problem-solving that humans excel at.
Key Features of ALE-Bench
- Real-World Relevance: Tasks are drawn from AHC, ensuring they reflect genuine optimization challenges.
- Interactive Framework: Supports iterative refinement with test-run feedback and visualizations, mimicking human contestant workflows.
- Open-Ended Evaluation: Since true optima are often out of reach, the benchmark allows for continuous improvement, even after surpassing human performance.
How Do Current AI Models Fare?
The researchers evaluated frontier LLMs like GPT-4.1, Gemini 2.5 Pro, and Claude 3.7 Sonnet in both one-shot and iterative-refinement settings. While these models showed high performance on specific problems, they still lag behind human experts in consistency across problem types and long-term improvement. For example, in iterative refinement, OpenAI’s o4-mini-high achieved an average performance score of 1520 (placing it in the top 11.8% of human contestants), but its performance distribution revealed gaps in handling diverse problem categories.
The Road Ahead
ALE-Bench highlights the need for AI systems to develop more robust, generalizable problem-solving strategies. The benchmark’s open-ended nature ensures it will remain relevant as AI capabilities grow, providing a yardstick for measuring progress in algorithmic engineering and long-horizon reasoning.
For more details, check out the ALE-Bench GitHub repo and dataset on Hugging Face.