14 Apr 2025 2 min read

How Sparse Autoencoders Are Unlocking CLIP’s Vision Transformer

Steering CLIP’s Vision Transformer with Sparse Autoencoders

Vision transformers like CLIP have become foundational to modern AI, powering everything from image recognition to multimodal systems. But despite their widespread use, how these models process visual information internally remains a black box. A new study from researchers at Mila, McGill University, and Fraunhofer HHI is changing that—using sparse autoencoders (SAEs) to crack open CLIP’s vision transformer and even steer its behavior.

The Key Findings

The team trained SAEs on CLIP’s vision transformer (ViT-B-32) and discovered several critical insights:

Vision vs. Language Sparsity: SAEs revealed stark differences in how vision and language models process information. CLIP’s spatial tokens (which handle image patches) showed 3-14x higher activation density than GPT-2’s tokens, suggesting vision models retain richer local features. Meanwhile, CLIP’s CLS token (which aggregates global information) behaved more like language model tokens, starting sparse before expanding in later layers.
Steerable Features: The researchers introduced a new metric, steerability (S), quantifying how precisely SAE features can manipulate CLIP’s outputs. They found 10-15% of SAE features are steerable—meaning they can reliably shift CLIP’s predictions when activated or suppressed. While the proportion of steerable features was similar between SAEs and the base model, SAEs offered thousands more steerable directions due to their higher dimensionality.
Practical Applications: By suppressing specific SAE features, the team improved performance on three vision tasks:

CelebA gender classification: Reduced bias from spurious correlations (e.g., blond hair).
Waterbirds background suppression: Improved accuracy by ignoring misleading environmental cues.
Typographic attack defense: Achieved state-of-the-art robustness against adversarial text overlays in images.

Why This Matters

Understanding and controlling vision transformers is becoming crucial as they underpin more AI systems. This work not only advances interpretability but also provides concrete tools for:

Debiasing models by suppressing unwanted feature activations.
Improving robustness against adversarial attacks.
Enabling fine-grained control over model behavior without retraining.

The researchers have open-sourced their SAEs, paving the way for safer, more reliable vision models. For AI practitioners, this is a major step toward demystifying—and mastering—CLIP’s inner workings.

Read the full paper on arXiv.