25 Apr 2025 2 min read

Large Language Models Struggle with Non-English Educational Tasks, Study Finds

Large language models (LLMs) like GPT-4o and Gemini are increasingly being used in educational settings, but a new study reveals significant performance gaps when these models are applied to non-English languages. Researchers from ETH Zurich and Bocconi University evaluated six popular LLMs across four educational tasks in six languages—Hindi, Arabic, Farsi, Telugu, Ukrainian, and Czech—and found that while the models perform reasonably well in most languages, their accuracy often drops significantly compared to English.

The Study

The research team tested the models on four key educational tasks:

Identifying student misconceptions (e.g., diagnosing why a student selected an incorrect answer in a math problem).
Providing targeted feedback (selecting the most appropriate feedback for a student’s incorrect answer).
Interactive tutoring (engaging in a multi-turn conversation to guide a student toward the correct solution).
Grading translations (evaluating the quality of a student’s translation compared to a perturbed version).

The models evaluated included GPT-4o, Gemini 2.0 Flash, Claude 3.7 Sonnet, Llama 3.1 405B, Mistral Large 2407, and Command-A. The results showed that performance in non-English languages often lagged behind English, with the drop most pronounced in lower-resource languages like Telugu and Czech.

Key Findings

English Dominates

On average, models achieved 70.9% accuracy in English but dropped to 49.7% in Telugu and 55.3% in Czech.
GPT-4o and Gemini 2.0 Flash were the top performers, while Claude 3.7 Sonnet struggled across the board.

Feedback Tasks Are Hardest

Models frequently defaulted to feedback for the correct answer rather than addressing the student’s mistake, suggesting they struggle with nuanced pedagogical reasoning.

Translated Prompts Don’t Help

Surprisingly, translating prompts into the target language rarely improved performance—and sometimes made it worse. English prompts often yielded better results.

Tutoring Is Inconsistent

Multi-turn tutoring was particularly challenging, with models frequently either failing to guide students or outright revealing the answer too soon.

Implications for Educators and Developers

The study highlights a critical gap in deploying LLMs for multilingual education. While models like GPT-4o and Gemini show promise, their performance varies widely by language, and developers should rigorously test them before deployment in non-English classrooms.

“Without proper evaluation, deploying LLMs in multilingual education risks exacerbating inequalities,” the authors warn. “A model that works well in English may fail in Hindi or Arabic, leading to misinformation or culturally inappropriate content.”

Limitations

The study acknowledges that translation quality may have affected results—machine-translated tasks could introduce noise, making it harder to isolate model weaknesses. Future work could involve human-translated datasets for more precise benchmarking.

Bottom Line

LLMs are powerful tools, but their educational utility is still uneven across languages. Before rolling them out globally, developers need to ensure they perform just as well in Telugu as they do in English.

For more details, check out the full paper on arXiv.