1 min read

LLMs Still Struggle with Structured Outputs: New Benchmark Reveals Performance Gaps

LLMs Still Struggle with Structured Outputs: New Benchmark Reveals Performance Gaps

Large Language Models (LLMs) have become indispensable tools in software development, but their ability to generate precise structured outputs remains inconsistent. A new benchmark called StructEval reveals significant performance gaps, even among state-of-the-art models like GPT-4o and Gemini.

The Challenge of Structured Outputs

Structured outputs—such as JSON, YAML, HTML, and React code—are essential for real-world applications. Yet, generating them correctly requires LLMs to adhere to strict syntax rules while preserving semantic intent. StructEval, introduced by researchers from the University of Waterloo and Vector Institute, evaluates LLMs across 18 formats and 44 task types, including:

  • Generation tasks (natural language → structured output)
  • Conversion tasks (structured format → another structured format)

The benchmark introduces novel metrics for format adherence and structural correctness, going beyond traditional semantic evaluation.

Key Findings

  1. Commercial models outperform open-source alternatives—but not by as much as you’d think. GPT-4o leads with an average score of 76.02%, while the best open-source model, Qwen3-4B, trails at 67.04%.
  2. Generation tasks are harder than conversions. Models struggle more when creating structured outputs from scratch rather than translating between formats.
  3. Visual rendering is tougher than text-only structures. Formats like SVG, Mermaid, and TikZ proved particularly challenging, with most models scoring below 50%.
  4. Some tasks are already ‘solved.’ JSON, HTML, and Markdown generation see near-perfect scores, suggesting LLMs excel at common formats.

Why This Matters

Structured outputs are critical for:

  • API integrations (JSON, XML)
  • Configuration files (YAML, TOML)
  • UI development (HTML, React, Vue)
  • Data visualization (SVG, Matplotlib)

Yet, even GPT-4o struggles with niche formats like TOML and Mermaid, highlighting room for improvement.

The Takeaway

While LLMs are getting better at structured generation, developers should still verify outputs—especially for less common formats. The full paper, available on arXiv, provides deeper insights into model capabilities and limitations.

Want to test your own LLM? Check out the StructEval GitHub.