2 min read

How Tokenization Choices Skew Language Model Probabilities

How Tokenization Choices Skew Language Model Probabilities

Modern language models don’t just learn from data—they’re also shaped by how that data is chopped up into tokens. A new paper from researchers at the University of Cambridge and ETH Zürich quantifies this phenomenon, which they call tokenization bias, showing that the mere presence or absence of a subword in a model’s vocabulary can dramatically alter the probabilities it assigns to text.

The Tokenization Bias Problem

Language models (LMs) like GPT-4 or LLaMA don’t process raw text directly. Instead, they rely on tokenizers—algorithms that break down text into subword units (like "hello" or "he" + "llo"). Ideally, the choice of tokenizer shouldn’t affect the probability a model assigns to a given string of characters. But in practice, it does—sometimes by orders of magnitude.

The researchers frame this as a causal question: How does including (or excluding) a subword in the tokenizer’s vocabulary affect the probability a trained model assigns to that word? For example, if "hello" is a single token in the vocabulary, does the model assign it higher probability than if it’s split into "he" + "llo"?

Measuring the Effect with Regression Discontinuity

Estimating this effect is tricky because each model is trained with only one tokenizer. You can’t just compare models with different tokenizers—they’d differ in many other ways too. Instead, the team used a clever workaround: regression discontinuity design (RD).

Tokenizers like Byte-Pair Encoding (BPE) and WordPiece build vocabularies incrementally, ranking subwords by frequency or other metrics and adding them until hitting a fixed vocabulary size (say, 32K subwords). This creates a natural experiment: subwords just above and below the cutoff are similar, but one group gets included in the vocabulary while the other doesn’t. By comparing these groups, the researchers could isolate the effect of tokenization itself.

Key Findings

  • Tokenization has a massive impact on probabilities. In smaller models (~57M parameters), including a subword in the vocabulary increased its probability by up to 17x compared to when it was split into multiple tokens.
  • The bias grows during training. Counterintuitively, models don’t “learn away” this effect—it actually increases as training progresses.
  • Larger models are less affected, but the bias persists even in 850M-parameter models, where a subword’s presence still roughly doubles its probability.
  • Different tokenizers (BPE vs. WordPiece) show similar effects, suggesting this isn’t just an artifact of one algorithm.

Why This Matters

Tokenization bias isn’t just a theoretical curiosity. It has real-world implications:

  1. Multilingual fairness: Low-resource languages often get longer tokenizations, which the study shows leads to systematically lower probabilities. This could exacerbate performance disparities.
  2. Length bias: Models already prefer shorter outputs; tokenization bias may amplify this by penalizing multi-token sequences.
  3. Vocabulary design: The findings suggest that expanding a model’s vocabulary (to include more subwords) can significantly boost the probability of those terms—a consideration for future LM development.

The work underscores that tokenization isn’t neutral. It’s a key architectural choice that directly shapes what models learn—and what they prefer to generate.