2 min read

Text-Aware Image Restoration: How Diffusion Models Are Learning to Read and Reconstruct Degraded Text

Text-Aware Image Restoration: How Diffusion Models Are Learning to Read and Reconstruct Degraded Text

Image restoration has long been a cornerstone of computer vision, aiming to recover high-quality images from degraded inputs. While recent advances in diffusion models have revolutionized natural image restoration, one critical challenge remains largely unaddressed: faithfully reconstructing textual content in degraded images. A new arXiv paper titled Text-Aware Image Restoration with Diffusion Models introduces a breakthrough approach to this problem, proposing a method that not only restores visual quality but also preserves textual fidelity—a task the authors dub Text-Aware Image Restoration (TAIR).

The Problem: Text-Image Hallucination

Traditional diffusion-based restoration models excel at generating plausible textures but often fail to accurately reconstruct text regions. Instead, they produce text-image hallucination—synthesizing text-like patterns that are visually coherent but incorrect. This is particularly problematic for applications like document digitization, street sign understanding, or AR navigation, where even minor text distortions can lead to significant information loss.

Introducing TAIR and SA-Text

The paper introduces TAIR, a novel task that explicitly requires the simultaneous recovery of visual content and textual fidelity. To tackle this, the authors present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. SA-Text is curated using a scalable pipeline that leverages vision-language models (VLMs) to validate and filter text regions, ensuring high-quality annotations.

TeReDiff: A Multi-Task Diffusion Framework

The core innovation is TeReDiff, a multi-task diffusion framework that integrates internal features from diffusion models into a text-spotting module. During training, the model learns to extract rich text representations, which are then used as prompts in subsequent denoising steps. This joint training allows the model to benefit from both generative priors and explicit text supervision, significantly improving text recognition accuracy in restored images.

Key Results

  • Quantitative Gains: TeReDiff outperforms state-of-the-art restoration methods across multiple metrics, achieving higher text recognition accuracy (F1-scores) and better perceptual quality (PSNR, SSIM).
  • Real-World Performance: On the Real-Text dataset, TeReDiff demonstrates robust performance in real-world degradation scenarios, where previous methods often fail.
  • User Study: Human evaluators overwhelmingly preferred TeReDiff’s outputs, with 98.5% favoring its text restoration quality and 89% preferring its overall image quality.

Why This Matters

The implications are vast. From improving OCR in low-quality documents to enhancing readability in AR applications, TAIR bridges the gap between generative image restoration and practical text-based use cases. The release of SA-Text also opens doors for further research in text-conditioned restoration.

Future Directions

The authors highlight challenges like small text restoration and complex scene layouts, suggesting future work could explore advanced prompting techniques or larger datasets. One thing is clear: TAIR represents a significant step toward models that don’t just see but also read.