26 Jun 2025 2 min read

Decrypto: A New Benchmark for Testing Multi-Agent Reasoning and Theory of Mind in AI

Large Language Models (LLMs) are increasingly being deployed in multi-agent scenarios, where they must interact with humans and other AI systems in both cooperative and competitive settings. A critical skill for these interactions is theory of mind (ToM)—the ability to reason about the mental states of other agents. However, current benchmarks for evaluating ToM in LLMs suffer from narrow scope, data leakage, and lack of interactivity. A new paper titled "The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind" introduces Decrypto, a game-based benchmark designed to address these gaps.

What is Decrypto?

Decrypto is inspired by the award-winning board game of the same name. It’s a pragmatic inference game where two agents (Alice and Bob) must exchange secret messages while a third agent (Eve) tries to intercept them. The game is designed to be as simple as possible in all dimensions except multi-agent reasoning, eliminating confounding factors like tokenization, long contexts, or embodied scenarios that often skew benchmark results.

Here’s how it works:

Alice (Encoder) receives a random 3-digit code (e.g., 2-3-4) and must provide three hints that refer to the meanings of four secret keywords (e.g., star, jazz, thunder, plane).
Bob (Decoder) and Eve (Interceptor) receive the hints and independently guess the code.
The guesses and the actual code are revealed, and the game updates with new hint and code histories.

The challenge lies in Alice crafting hints that Bob can decode but Eve cannot—a task that requires second-order ToM (modeling Alice’s beliefs about Eve’s beliefs).

Why Decrypto Stands Out

Future-Proof: Unlike static benchmarks, Decrypto’s difficulty scales with the agents’ capabilities, making it resistant to saturation.
Interactive: It’s the first platform for designing interactive ToM experiments, enabling studies on cooperation, competition, and human-AI coordination.
Controlled Complexity: The game strips away unnecessary challenges (e.g., math, spatial reasoning) to focus purely on language-based reasoning and ToM.

Key Findings

LLMs Lag Behind Humans: Even state-of-the-art models like GPT-4o and Claude 3.7 struggle to match human performance in Decrypto. Simple word-embedding baselines (GloVe, Word2Vec) often outperform LLMs in cooperative settings.
ToM Failures: In variants of classic cognitive science experiments (e.g., the Smarties Task), newer reasoning models like Claude 3.7 and o1-high performed worse than older models like Llama 3.1-70B at tasks requiring representational change and false belief reasoning.
Perspective-Taking Flaws: Most models failed to accurately predict Eve’s guesses, often assuming Eve had access to privileged information (the keywords).

Implications for AI Development

Decrypto highlights a critical gap in current AI benchmarks: the lack of robust evaluations for multi-agent reasoning and ToM. The paper suggests that:

Training Methods Matter: Models fine-tuned for verifiable tasks (e.g., math) may sacrifice ToM abilities.
Human-AI Coordination is Hard: LLMs struggle to align their reasoning with humans, even in simple games.
Interactive Benchmarks Are Essential: Static datasets can’t capture the dynamics of real-world multi-agent interactions.

What’s Next?

The authors open-source Decrypto as a platform for future research, including:

Fine-tuning LLMs with multi-agent reinforcement learning.
Studying cross-cultural pragmatics in AI communication.
Developing methods to improve ToM in LLMs.

Decrypto isn’t just a benchmark—it’s a call to rethink how we evaluate and develop AI agents for a future where they’ll need to navigate complex social and strategic interactions.

For more details, check out the full paper on arXiv or the GitHub repository.