27 Jun 2025 2 min read

mTSBench: The Largest Benchmark for Multivariate Time Series Anomaly Detection and Model Selection

Multivariate time series anomaly detection (MTS-AD) is a critical task in domains like healthcare, cybersecurity, and industrial monitoring. Yet, it remains challenging due to complex inter-variable dependencies, temporal dynamics, and sparse anomaly labels. Enter mTSBench, the largest benchmark to date for MTS-AD and unsupervised model selection, introduced in a new arXiv paper by researchers from the University of Illinois Urbana-Champaign and Sandia National Laboratories.

What is mTSBench?

mTSBench is a comprehensive evaluation suite designed to standardize and accelerate research in MTS-AD. It spans 344 labeled time series across 19 datasets and 12 diverse application domains, including healthcare, finance, and industrial systems. The benchmark evaluates 24 anomaly detection methods, including large language model (LLM)-based detectors, and systematically benchmarks unsupervised model selection techniques under standardized conditions.

Key Findings

No Single Detector Dominates: The study confirms prior findings that no single anomaly detector excels across all datasets. Performance varies significantly depending on the dataset and anomaly type, underscoring the importance of adaptive model selection.
Model Selection Gaps: Even state-of-the-art selection methods fall far from optimal performance, revealing critical gaps in current techniques. For example, the best unsupervised selectors (FMMS and Orthus) still lag behind the "near-optimal" baseline by 15-30% on key metrics like AUC-PR and VUS-PR.
LLMs Show Promise (But Aren’t Perfect): LLM-based detectors like OFA and ALLM4TS demonstrate competitive performance, particularly in capturing temporal dependencies. However, their results are inconsistent, with ALLM4TS showing sensitivity to dataset-specific noise.

Why This Matters

Anomaly detection is often a high-stakes task—think detecting equipment failures in a power plant or fraudulent transactions in real-time. But deploying the wrong detector can lead to missed anomalies or false alarms. mTSBench provides a unified framework to:

Compare detectors fairly: By evaluating all methods on the same datasets and metrics, researchers can identify strengths and weaknesses more clearly.
Improve model selection: The benchmark highlights where current selection strategies fail, paving the way for more robust adaptive methods.
Catalyze innovation: With open-source code and standardized evaluations, mTSBench lowers barriers for future work in adaptive anomaly detection.

The Bottom Line

mTSBench is a milestone for time-series research, offering the most extensive evaluation of MTS-AD methods to date. Its findings challenge the community to develop better model selection techniques and explore the untapped potential of foundation models in this space. For practitioners, the benchmark provides actionable insights into which detectors work best for specific scenarios.

Check out the full paper and code on GitHub to dive deeper into the results and methodology.