26 Jun 2025 3 min read

Inside You Are Many Wolves: How LLMs Balance Truth and Politeness Like Humans

Large language models (LLMs) are increasingly being used in social contexts where they must navigate delicate trade-offs—like telling a friend their cake is terrible while still being kind. But how do these models actually weigh competing values like truthfulness and politeness? A new study from Harvard and DeepMind researchers uses cognitive science to peek inside the "minds" of LLMs and see how they make these decisions.

The Wolf Inside Every LLM

The study, titled Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs, applies a well-established model of human polite speech to LLMs. The model, from cognitive scientist Ethan Yoon and colleagues, formalizes how people balance competing goals when giving feedback. For example, when critiquing a friend’s cake, you might soften your language ("not amazing") to preserve their feelings, even if the literal truth ("terrible") would be more informative.

The researchers tested this model on a range of LLMs, including closed-source models like Claude, Gemini, and GPT-4, as well as open-source models like Llama and Qwen. They presented the models with scenarios where a speaker had to give feedback on a 1-5 star scale (e.g., a cake, a painting) and measured how often the models chose direct vs. indirect language ("amazing" vs. "not bad").

Key Findings: Reasoning Models Prioritize Truth

One striking result: models optimized for reasoning—like those fine-tuned for math or coding—tended to prioritize informational utility (truth) over social utility (politeness). For example, Claude-3.7 with "medium reasoning effort" was significantly more likely to give direct feedback than its non-reasoning counterpart. The same pattern held for OpenAI’s GPT-4o and its reasoning variant, o4-mini.

"This suggests that training models for reasoning doesn’t just improve their math skills—it might also make them more blunt," says Sonia Murthy, the paper’s lead author. "That could be good or bad depending on the context."

Open-Source Models: Base Model Matters Most

The team also studied how alignment methods (like DPO or PPO) and datasets (like UltraFeedback or Anthropic’s HH-RLHF) affect these trade-offs. Surprisingly, the biggest factor wasn’t the alignment method or dataset—it was the base model itself. Qwen, which excels at math, consistently weighted truth higher than Llama, regardless of how it was fine-tuned.

Another insight: most of the shifts in value trade-offs happened early in training. "The first 25% of training is where the big changes happen," says Murthy. "After that, the model’s behavior stabilizes."

Why This Matters

Understanding how LLMs balance values isn’t just academic—it’s critical for alignment. For instance, the study’s method could help detect sycophancy, where models prioritize pleasing users over telling the truth. The researchers didn’t find strong evidence of sycophancy in current models, but their framework could spot it if it emerges.

"We’re not just asking, ‘Is this model helpful or harmless?’" says Murthy. "We’re asking, ‘How does it decide between being helpful and harmless when those goals conflict?’ That’s what humans do every day."

The study is part of a growing effort to bridge cognitive science and AI alignment. By treating LLMs as if they have "goals" (even if they don’t in a human sense), researchers can reverse-engineer the implicit rewards driving their behavior—and maybe build models that handle social nuance as deftly as humans do.

The Big Picture

This work highlights a fundamental tension in AI alignment: optimizing for one value (like truthfulness) can inadvertently suppress others (like politeness). As LLMs take on more social roles—from customer service to therapy—understanding these trade-offs will only become more important.

Or, as the paper’s title suggests: inside every LLM, there are many wolves. The question is which one we’re feeding.