2 min read

How CLIP Models Rely on Unexpected Features: A Deep Dive into Latent Component Attribution

How CLIP Models Rely on Unexpected Features: A Deep Dive into Latent Component Attribution

Transformer-based CLIP models have become a cornerstone for text-image probing and feature extraction, but understanding the internal mechanisms behind their predictions remains a challenge. A recent study by Dreyer et al. introduces a scalable framework that not only reveals what latent components activate for but also how they drive predictions, uncovering surprising semantic reliance in CLIP models.

The Challenge of Understanding CLIP

CLIP models are trained on paired images and text to learn a wide range of visual concepts without explicit human annotation. While this semi-supervised training enables strong generalization, it also raises questions about what patterns CLIP relies on and how these align—or misalign—with human expectations. Prior work has focused on explaining CLIP’s representations via predefined concept sets, but this approach often overlooks emergent features from biases in image-text data, such as spurious correlations and ambiguous linguistic cues.

A New Framework for Interpretability

The study introduces a holistic framework combining Sparse Autoencoders (SAEs) and attribution patching to analyze CLIP’s latent components. SAEs extract interpretable latent components from CLIP representations, while attribution patching quantifies each component’s contribution to model predictions. This dual approach allows researchers to uncover both what concepts are encoded and how they influence outputs.

Key findings include:

  1. Instance-wise Component Attribution: The framework adapts attribution patching to CLIP, enabling instance-specific attribution of latent components. This method outperforms the widely used Logit Lens technique, which provides only a global approximation of component relevance.
  2. Diverse and Interpretable Components: SAEs discover a rich set of interpretable components within CLIP models. Highly relevant latents tend to be more interpretable than weak latents, and larger CLIP models encode a broader range of semantic concepts.
  3. Unexpected Concept Reliance: The framework automatically identifies reliance on components that encode semantically unexpected or spurious concepts. These include polysemous words, compound nouns, visual typography, and dataset artifacts.

Surprising Discoveries

Applied across multiple CLIP variants, the method uncovers hundreds of surprising components. For example, some components activate for game renderings, clipart, or even US states—concepts not part of standard dataset annotations. Text embeddings, while prone to semantic ambiguity, are more robust to spurious correlations compared to linear classifiers trained on image embeddings.

A Case Study in Medical AI

The study also examines a skin lesion detection task, revealing how linear classifiers trained on CLIP embeddings can amplify hidden shortcuts. For instance, a classifier incorrectly associated red hue backgrounds with non-melanoma samples. By identifying and correcting this reliance, the researchers improved model robustness, demonstrating the practical benefits of mechanistic interpretability.

Implications for AI Safety

These findings underscore the importance of moving beyond global summaries to understand how foundation models like CLIP make decisions. The framework provides a scalable tool for debugging and ensuring the reliable deployment of multimodal AI systems. As the authors note, "holistic, mechanistic interpretability" is essential for uncovering and mitigating unexpected model behaviors.

For more details, check out the full paper on arXiv and the accompanying code on GitHub.