18 Apr 2025 2 min read

Beyond Transformers: How Miras is Redefining Sequence Models with Attentional Bias and Retention Gates

In a groundbreaking paper titled It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization, researchers from Google introduce Miras, a novel framework for designing sequence models that unifies and extends modern architectures like Transformers, Titans, and linear RNNs. The paper reimagines neural networks as associative memory modules driven by attentional bias—a concept inspired by human cognitive prioritization—and introduces innovative retention mechanisms to balance learning and memory stability.

The Core Idea: Associative Memory with Attentional Bias

The team redefines sequence models as associative memories that learn mappings between keys and values using an internal objective called attentional bias. Surprisingly, most existing models (e.g., Transformers, linear RNNs) rely on just two types of attentional bias: dot-product similarity or ℓ₂ regression. Miras expands this by proposing alternative attentional biases, such as Huber loss and robust optimization objectives, which prioritize different aspects of memory recall and noise resilience.

Key Innovations

Attentional Bias Configurations: The paper introduces variants like ℓₚ-norm objectives (e.g., ℓ₁ for sparse memories) and Huber loss for outlier robustness. These biases determine how models prioritize and store information.
Retention Gates: Forgetting mechanisms are reinterpreted as retention regularization, with new gates like KL-divergence-based retention and elastic net regularization. These gates control how models balance new learning with past memory retention.
Miras Framework: A unified design space with four choices: memory architecture, attentional bias, retention gate, and learning algorithm. This flexibility allows models to be tailored for specific tasks.

New Models: Moneta, Yaad, and Memora

Three new sequence models emerge from Miras:

Moneta: Uses ℓₚ attentional bias and ℓₙ retention gates for robust memory management.
Yaad: Leverages Huber loss to protect against extreme events (e.g., noisy tokens).
Memora: Employs KL-divergence retention for stable memory updates over long sequences.

Performance Highlights

Experiments show these models outperform Transformers and linear RNNs in:

Language modeling: Lower perplexity on FineWeb-Edu and C4 datasets.
Long-context tasks: Superior needle-in-haystack retrieval (up to 8K tokens).
Scaling: Efficient parallel training with chunk-based optimization tricks.

Why It Matters

Miras not only explains existing architectures but also opens doors to more efficient, robust sequence models. By decoupling memory design from optimization, it offers a roadmap for future AI systems that need to handle long contexts, noisy data, or dynamic retention requirements.

For businesses, this means:

Faster inference: Linear-time recurrent variants like Yaad reduce compute costs.
Better memory management: Retention gates optimize context window usage.
Task-specific tuning: Miras’ modularity lets teams customize models for domains like finance (high recall) or robotics (noise resilience).

The Verdict

Miras is a leap toward interpretable, adaptable sequence models. Its blend of cognitive inspiration and optimization rigor could redefine how we build AI for long-context challenges—from legal document analysis to real-time sensor data processing.