09 Jun 2025 2 min read

Distillation Robustifies Unlearning: A New Method to Make AI Forget

Large language models (LLMs) are trained on massive datasets, which means they inevitably learn things we’d rather they didn’t—like how to build cyberweapons or generate harmful content. Current methods to "unlearn" these capabilities are flimsy; a few steps of fine-tuning can bring the unwanted knowledge right back. But a new paper from researchers at the University of Pennsylvania, MIT, and Brown University proposes a surprisingly simple solution: distillation.

The Problem with Unlearning

Machine unlearning aims to remove specific knowledge or capabilities from a trained model while preserving its overall functionality. Existing methods, like gradient ascent or maximizing entropy on "forget" data, suppress unwanted behaviors but don’t truly erase the underlying capabilities. As a result, adversarial fine-tuning—even just a few steps—can quickly restore the model’s original, problematic abilities.

The Distillation Solution

The key insight from the paper is that distillation robustifies unlearning. When an "unlearned" model (one that’s been fine-tuned to suppress unwanted behavior) is distilled into a fresh, randomly initialized student model, the student inherits the desired behavior but leaves the unwanted capabilities behind. This happens because distillation forces the student to learn from the teacher’s outputs, effectively breaking the latent connections that stored the original knowledge.

In experiments, distilled models resisted relearning attacks far better than their unlearned-but-not-distilled counterparts. Remarkably, they performed almost as well as models retrained from scratch with perfect data filtering—the gold standard for unlearning—but at a fraction of the computational cost.

Introducing UNDO: A Scalable Method

Building on this insight, the researchers propose Unlearn-Noise-Distill-on-Outputs (UNDO), a method that introduces a tunable tradeoff between compute cost and robustness. Here’s how it works:

Unlearn: Apply standard unlearning methods (like MaxEnt or GradDiff) to suppress unwanted behavior.
Noise: Corrupt the model’s weights by mixing them with random noise. The more noise, the more robust the unlearning—but the more compute is needed to recover performance.
Distill: Train the noised model to mimic its own original (unlearned) outputs, repairing the damage while preserving robustness.

UNDO establishes a new Pareto frontier for unlearning, offering near-perfect robustness at 60-80% of the compute cost of full retraining. It also requires only 0.01% of the pretraining data to be labeled, making it practical for real-world deployment.

Real-World Implications

Distillation is already widely used to make models smaller and faster. By adding an unlearning step beforehand, developers can get robust capability removal almost for free. The method has been tested on synthetic tasks (like forgetting arithmetic operations) and real-world benchmarks (like the Weapons of Mass Destruction Proxy), showing consistent improvements over existing techniques.

Limitations and Future Work

The biggest drawback is compute cost: distillation is more expensive than fine-tuning alone. But for high-stakes applications—like preventing misuse of AI for biosecurity threats—the tradeoff may be worth it. Future work could explore ways to make UNDO even more efficient or apply it to larger models.

Why This Matters

As AI systems become more powerful, the ability to reliably remove dangerous capabilities will be critical. This paper shows that distillation isn’t just a tool for efficiency—it’s also a powerful safeguard against misuse. By baking unlearning into the distillation pipeline, we can make AI safer without starting from scratch.