Cartridges: A Memory-Efficient Alternative to In-Context Learning for Long Documents
Large language models (LLMs) are increasingly being used to answer queries grounded in extensive text corpora—whether that's codebases, legal documents, or chat histories. The standard approach is to place the entire corpus in the model's context window and rely on in-context learning (ICL). But as context windows grow to 100K–1M tokens, the memory demands of maintaining a key-value (KV) cache for these contexts become prohibitive, severely limiting throughput and scalability.
A team from Stanford University, Caltech, University at Buffalo, and Google DeepMind has proposed an intriguing alternative: Cartridges. Instead of loading the full corpus into the KV cache at inference time, Cartridges are small, trained KV caches that distill the essential information from a corpus offline. These Cartridges can then be loaded at inference time, drastically reducing memory consumption while maintaining the model's ability to answer diverse queries about the corpus.
The Problem with In-Context Learning
ICL is flexible but costly. For example, serving LLaMA 70B with a 128K-token context requires 84GB of memory (at 16-bit precision). This leads to a 77× drop in peak throughput when scaling from 1K to 120K tokens. Prior attempts to mitigate this—like prompt compression or KV cache compression—often degrade performance at higher compression ratios.
How Cartridges Work
Cartridges are trained via a two-step process called Self-Study:
- Synthetic Data Generation: The model generates synthetic conversations about the corpus, chunking long documents to handle contexts beyond the model's window.
- Context Distillation: The Cartridge is trained to mimic the model's behavior when the full corpus is in context, using a distillation objective that aligns next-token distributions.
This approach ensures Cartridges generalize across diverse query types—factual recall, summarization, mathematical reasoning—while preserving structural awareness of the document.
Key Results
- Memory Efficiency: Cartridges match ICL performance while using 38.6× less memory and enabling 26.4× higher throughput.
- Context Length Extrapolation: On the MTOB benchmark (a low-resource translation task), a Cartridge trained on a 484K-token textbook outperformed ICL on a truncated version by 11.0 chrF points.
- Composability: Multiple Cartridges can be concatenated at inference time without retraining, enabling queries across multiple documents.
Why This Matters
Cartridges offer a practical solution for applications where users repeatedly query the same corpus—think medical records, legal filings, or code repositories. By amortizing the cost of training over many queries, they make long-context LLM applications more feasible.
Read the full paper on arXiv: http://arxiv.org/abs/2506.06266v1