Last week, Chinese Big-Tech company Alibaba released its best model to date: Qwen-Max. It is a Mixture-of-Experts model competing with "non-reasoning" GPT-4o, Claude 3.5, and Deepseek-V3.
While everyone was discussing Deepseek, Alibaba's advancement is no less impressive. Qwen2.5 Max, a direct competitor to the "non-reasoning" V3 chat model, demonstrates better performance.
Rock Solid Stability
I have evaluated the new model, as well as a number of smaller Qwen models (closed Qwen Plus, Qwen Turbo, and open Qwen2.5 72B), via LLM Chess.
LLM Chess simulates multiple games of a random bot playing against an LLM. With thousands of prompts and millions of tokens, every game is unique (unlike most evals which have fixed sets of prompts/pass conditions). Several metrics are collected and aggregated across multiple runs. The performance of a model is evaluated in reasoning (% of Wins/Draws) and steer-ability/durability (how often a model fails to follow instructions or drops out of the game due to multiple erroneous replies).
The eval places the model at the top of the leaderboard among other SOTA models from the top AI labs.
Rank | Player | Wins | Draws | Mistakes | Tokens |
---|---|---|---|---|---|
1 | o1-preview-2024-09-12 | 46.67% | 43.33% | 9.29 | 2660.07 |
2 | o1-mini-2024-09-12 | 30.00% | 50.00% | 4.29 | 1221.14 |
3 | deepseek-reasoner-r1 | 22.58% | 19.35% | 52.93 | 4584.97 |
4 | anthropic.claude-v3-5-sonnet-v1 | 6.67% | 80.00% | 1.67 | 80.42 |
5 | gpt-4o-2024-11-20 | 4.23% | 87.32% | 0.14 | 50.58 |
6 | grok-2-1212 | 4.08% | 30.61% | 32.79 | 66.23 |
7 | anthropic.claude-v3-5-sonnet-v2 | 3.33% | 88.33% | 2.98 | 90.85 |
8 | gpt-4o-2024-08-06 | 1.67% | 83.33% | 0.16 | 7.70 |
9 | qwen-max-2025-01-25 | 0.00% | 96.67% | 0.00 | 6.06 |
10 | gpt-4-turbo-2024-04-09 | 0.00% | 93.33% | 0.00 | 6.03 |
11 | anthropic.claude-v3-opus | 0.00% | 83.33% | 9.31 | 72.86 |
12 | gpt-4o-2024-05-13 | 0.00% | 80.00% | 3.54 | 31.33 |
13 | gpt-4o-mini-2024-07-18 | 0.00% | 60.00% | 35.78 | 108.22 |
.. | .. | .. | .. | .. | ... |
19 | qwen-plus-2025-01-25 | 0.00% | 9.09% | 66.03 | 440.41 |
.. | .. | .. | .. | .. | ... |
21 | qwen2.5-72b-instruct | 0.00% | 6.67% | 89.38 | 219.47 |
.. | .. | .. | .. | .. | ... |
24 | deepseek-chat-v3 | 0.00% | 2.86% | 98.32 | 246.93 |
.. | .. | .. | .. | .. | ... |
31 | qwen-turbo-2024-11-01 | 0.00% | 0.00% | 230.66 | 192.37 |
.. | .. | .. | .. | .. | ... |
While the model doesn't shine in reasoning (it is not a breakthrough in win rates like o1/R1 models), it's no worse than other non-reasoning models in chess (such as GPT-4o and Claude 3.5), especially if you adjust for Losses:
Rank | Model | Win-Loss |
---|---|---|
4 | anthropic.claude-v3-5-sonnet-v1 | -6.67% |
5 | gpt-4o-2024-11-20 | -4.24% |
9 | qwen-max-2025-01-25 | -3.33% |
Speaking of the instruction following, there were 0 erroneous replies by the model - i.e. during 60 game simulations the model was prompted over 17000 times, and not once it hallucinated an illegal move OR didn't comply to prompt instructions returning non existing action name.
Here's an example of Qwen Max's dialog, there were thousands like that during the eval run (each one a bit different due to changing board state).
Deepseek's V3 on the other hand was very unstable. The reason it scored that low on the leaderboard is due to breaking the game loop too often. As you can see in this example of a dialog, V3 failed to pick a legal move multiple times, while there was a list of legal moves provided in the prompt:
Proxy (to Player_Black):
h8h7,h8h6,h8h5,h8h4,h8h3,h8h2,h8h1,g8h6,g8f6,e8f7,e8d7,d8d7,d8d6,d8d5,c8d7,c8b7,c8e6,c8a6,a8b8,f4g6,f4e6,f4h5,f4d5,f4h3,f4d3,f4g2,f4e2,d4e3,d4c3,g7g6,e7e6,c7c6,a7a6,d4d3,e7e5,a7a5
--------------------------------------------------------------------------------
Player_Black (to Proxy):
Thank you for providing the list of legal moves. Let me analyze the position further and decide on the best move for Black.
...
**Action:** `make_move f6d5`
Let me know the updated board position after this move!
Another side effect noticed with V3 is its verbosity and how the model goes crazy as the game progresses. For example, while in the initial board state, the model is quite concise in its replies, it tends to get into a game discussion with further moves, spitting out more tokens and hallucinating illegal moves.
Progress and Significance
Pay attention to how the Qwen Model made a leap in terms of steerability/stability, the older Plus/Turbo and open-weight Qwen2.5 72B models were horrible in this regard - the number of mistakes skyrocketed, the average duration of the game was too short to score a single Draw:
Rank | Model | Average Moves in Game* |
---|---|---|
9 | qwen-max-2025-01-25 | 196.50 |
19 | qwen-plus-2025-01-25 | 87.58 |
21 | qwen2.5-72b-instruct | 64.10 |
31. | qwen-turbo-2024-11-01 | 20.42 |
* a single game is limited to 200 moves
Why should you even care if a model can stay in some synthetic Chess game loop in the first place? If you deploy an LLM-based product, with a prompt chain integrated into the workflow OR external data sources ingesting data into the prompt (e.g. RAG), model stability plays a crucial role.
There are tens of models that went through LLM Chess and it is fascinating to observe how some models can be triggered into a failure mode by slight variation to prompts. The best SOTA models can become blind to the list of legal moves provided in the dialog, they can refuse to respond with correct action names or fail to recover from an error message and prompt to self-correct. There can be multiple reasons for such abnormality. E.g. tokeniser glitches, coming across out-of-distribution unseen inputs etc.
Last Summer I was experimenting with agentic workflows (based on CrewAI), and tried to rely on local models... The moment I switched from GPT-4o/mini to various top smaller models (e.g. Llama fine tunes from the newsletters of the time) it was a disappointment - a somewhat simple and sequential flow appeared to be broken. While for some models (e.g. Gemma 2) the results were quite decent. LLM Chess Leaderboard gives a good perspective on which models will likely fail in agentic scenarios.
Availability and Price
The key disadvantage of Qwen Max is that there are no open weights. Unlike Deepseek V3, one can not self-host and fine-tune the new Qwen Max model. Besides, they shared very little detail about the model, there's even no information on the number of parameters. We get used to the good stuff quickly. Now, a release of a commercial closed model available only through API brings up thoughts that something is missing...
Qwen Max API is roughly half the price of OpenAI/Anthropic competitors and ~5 times more expensive than Deepseek V3 API.
Model | Input Price (per MTok) | Output Price (per MTok) | Batch API Discount |
---|---|---|---|
Claude 3.5 Sonnet | $3.00 | $15.00 | 50% off with Batches API |
gpt-4o | $2.50 | $10.00 | 50% off with Batch API |
deepseek-chat (Discounted) | $0.14 | $0.28 | Discounted until 2025-02-08 |
deepseek-chat (Normal) | $0.27 | $1.10 | - |
Qwen-Max-2025-01-25 | $1.60 / 1K tokens | $6.40 / 1K tokens | - |
Deepseek servers have seen a lot of traffic lately, and users complained about the unavailability of their services. In this regard, Qwen is in a better position, after all, it is hosted by Alibaba Cloud - a Chinese alternative to AWS.
Top comments (0)