10 Jun 2025 2 min read

Self Forcing: How Autoregressive Video Diffusion Models Are Closing the Train-Test Gap

The world of video generation is undergoing a quiet revolution. While diffusion models have dominated the field with their ability to produce high-quality, temporally coherent videos, they’ve long struggled with a fundamental limitation: the gap between how they’re trained and how they’re used in the real world. Enter Self Forcing, a new training paradigm from researchers at Adobe and UT Austin that’s bridging this divide—and doing it at speeds fast enough for real-time applications.

The Problem: Exposure Bias in Video Generation

Traditional autoregressive video diffusion models—those that generate videos frame by frame—are typically trained using one of two methods:

Teacher Forcing (TF): The model learns to predict the next frame conditioned on perfect ground-truth previous frames.
Diffusion Forcing (DF): The model is trained on videos where each frame has independently sampled noise levels, denoising based on noisy context frames.

Both approaches suffer from exposure bias: during inference, the model must generate frames based on its own (imperfect) previous outputs, not the pristine training data. This mismatch leads to error accumulation—subtle mistakes compound over time, causing videos to degrade in quality as generation progresses.

The Solution: Self Forcing

Self Forcing tackles this by mirroring the inference process during training. Instead of relying on ground-truth or noisy frames, the model generates each frame conditioned on its own previous outputs—just like it will during actual use. This is achieved through:

Autoregressive Rollout with KV Caching: The model unrolls generation sequentially during training, caching key-value (KV) embeddings to maintain context efficiently.
Holistic Video-Level Loss: Rather than optimizing frame-by-frame denoising, Self Forcing evaluates the entire generated sequence using distribution-matching objectives like DMD (Distribution Matching Distillation), SiD (Score Identity Distillation), or adversarial losses.
Efficiency Tricks: A few-step diffusion backbone and gradient truncation keep training computationally feasible, while a rolling KV cache enables streaming generation of arbitrarily long videos.

Why It Matters: Real-Time, High-Quality Video

The results are striking. Self Forcing achieves 17 FPS generation with sub-second latency on a single H100 GPU—fast enough for live applications like interactive content creation or game simulation. Even more impressively, it matches or surpasses the quality of slower, non-causal diffusion models. In user studies, it was preferred over alternatives like Wan2.1 and CausVid by significant margins (see Figure 4 in the paper).

The Bigger Picture

Self Forcing isn’t just a technical improvement—it represents a philosophical shift in how we train generative models. By aligning training with inference, it addresses a core limitation of parallelizable transformer architectures: their tendency to falter when errors accumulate over sequential steps. The authors suggest this approach could extend beyond video to other domains where continuous data is generated autoregressively.

For businesses, the implications are clear. Real-time, high-quality video generation is no longer a distant dream but an imminent reality. Whether for marketing, entertainment, or simulation, models like this will soon enable applications we’re only beginning to imagine.

Read the full paper on arXiv: Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion