Most LLM benchmarks are in English. They do not give us an accurate assessment of how well LLMs perform tasks in other languages. The rankings from the English benchmarks do not necessarily translate into how well tasks in **other languages **are handled.
Most LLM benchmarks in other languages (like German) are based on large data sets of chat messages, customer support requests, or similar. The issue with these is that they are publicly available. So they are potentially part of an LLM's training data. I therefore consider them invalid for benchmarking.
To solve this issue, I developed ML•LLM, a German-language LLM benchmark, split into two parts; logic and non-logic.
- ML•LLM•L is a set of questions and riddles that require logic and reasoning to find the correct answer.
- ML•LLM•NL is a set of questions that require knowledge of the German language or laws/customs in Germany.
All questions are in German.
ML•LLM•L (Logic)
Only Grok 3 DeepSearch managed to answer all questions correctly.
The second place is split between Qwen2.5-Max, DeepSeek R1, and ChatGPT o3-mini-high. After that, we see multiple Qwen, OpenAI, Grok, and DeepSeek models.
Gemini (Google), Claude, and Mistral all had disappointing results in this benchmark.
ML•LLM•NL (Non-Logic)
Here, four models managed to answer all questions correctly: Grok3 Think, Grok 3 DeepSearch, DeepSeek V3, and ChatGPT 4o.
Overall results
The overall results show that Grok from xAI is the clear leader. With DeepSeek and some of the OpenAI models close behind.
With OpenAI and Google, we can observe that they are making progress from model to model.
Cohere (Command R), Tencent (Hunyuan Large), Claud and Mistral need to catch up.
Surprising Results
While most LLMs have learned to count the number of R's in the word strawberry, they often cannot do the same in German. Probably because the "R in strawberry" example became so famous that some labs specifically fine-tuned for it or decided to handle it via the system prompt.
With the reasoning models, I encountered a few questionable chains of thought. One of the worst ones came from Qwen's QwQ-32B. Counting the number of letters in the German word fünf (five) should have been a simple thing. But it led to a lot of confusion.
QwQ-32B is also the only model that created an endless chain, filling my screen with an endless stream of "hmm, hmm, hmm, hmm, hmm" or rotating between the same three solutions dozens of times before I pulled the plug.
Whenever I checked, all reasoning models did their reasoning in English. Even with very long and detailed German questions. Funnily, I have encountered both Chinese and Spanish reasoning steps when using English prompts. I was surprised I did not observe more German reasoning attempts. I guess here it becomes transparent how little of the training data is in German. I expect this effect to be even stronger for languages like Finnish or Czech.
Next Steps
Do you also see the need for non-English LLM benchmarks?
Are you aware of anyone else working on this?
Top comments (0)