2 min read

StableMTL: How Latent Diffusion Models Are Revolutionizing Multi-Task Learning

StableMTL: How Latent Diffusion Models Are Revolutionizing Multi-Task Learning

The Challenge of Multi-Task Learning in AI

Multi-task learning (MTL) is a cornerstone of modern AI systems, especially in computer vision applications like robotics and virtual reality. These systems often need to estimate multiple scene cues—semantic segmentation, depth, optical flow—simultaneously and efficiently. However, MTL has long been hampered by the need for extensive, pixel-level annotations for every task, which are costly and time-consuming to produce. Recent work has explored training with partial labels, but these methods are typically confined to single domains and struggle with balancing conflicting task objectives.

Enter StableMTL, a groundbreaking approach from researchers at Valeo.ai and Inria that repurposes Latent Diffusion Models (LDMs) for multi-task learning. By leveraging the generalization power of diffusion models, StableMTL trains on multiple synthetic datasets, each labeled for only a subset of tasks, and achieves remarkable zero-shot generalization to real-world data. The results? A model that outperforms existing methods on seven tasks across eight benchmarks, with an overall improvement of +83.54Δm.

How StableMTL Works

StableMTL builds on recent advances in repurposing diffusion models for dense prediction tasks. Instead of relying on per-task losses—which require careful balancing—StableMTL adopts a unified latent loss, enabling seamless scaling to more tasks. The method consists of two key stages:

  1. Single-Stream Architecture: A UNet is fine-tuned to predict task-specific latents from input image latents, conditioned on task tokens. This stage uses a task-gradient isolation scheme to prevent dominant tasks from overwhelming others.
  2. Multi-Stream Architecture: To encourage inter-task synergy, StableMTL introduces a multi-stream model with a task-attention mechanism. This converts the traditional N-to-N task interactions into a more efficient 1-to-N attention, promoting effective cross-task sharing.

Key Innovations

  • Partial Labels, Multiple Datasets: StableMTL trains on an ensemble of synthetic datasets (Hypersim, Virtual KITTI 2, FlyingThings3D), each with partial annotations for different tasks. This setup is far more realistic than single-dataset training.
  • Latent Regression: By framing multi-task learning as a latent regression problem, StableMTL avoids the need for task-specific losses and complex balancing.
  • Task-Attention Mechanism: The model dynamically attends to relevant features from auxiliary tasks, enhancing performance without the computational overhead of full N-to-N attention.

Performance and Generalization

StableMTL demonstrates robust generalization to real-world data, even on domains significantly out-of-distribution (e.g., DAVIS and YouTube-VOS). Qualitative results show sharp, accurate predictions across tasks like semantic segmentation, depth estimation, and optical flow. Quantitatively, StableMTL outperforms baselines like DiffusionMTL and JTR, with improvements such as +9.87 mIoU for semantic segmentation and -12.37 mAE for normal estimation.

Why This Matters

StableMTL represents a significant leap forward in multi-task learning, particularly for applications where annotation costs are prohibitive. By repurposing diffusion models, the method unlocks new possibilities for training AI systems on synthetic data while maintaining strong real-world performance. This could accelerate development in autonomous driving, robotics, and augmented reality, where multi-task models are essential.

For more details, check out the GitHub repository and the full paper on arXiv.