DEV Community

Cover image for Deepseek R1 vs OpenAI o1
Maxim Saplin
Maxim Saplin

Posted on

Deepseek R1 vs OpenAI o1

Deepseek R1 is out - available via Deepseek API or free Deepseek chat. If you are following LLM/Gen AI space you've likely seen titles, read posts, or watched videos praising the model: 671B MoE model, open weight, lots of info on training process. It challenges OpenAI's reasoning models (o1/o1-mini) across many benchmarks at a fraction of a cost... There are even smaller "distilled" versions of R1 available for the local run (via llama.cpp/ollama/lmstudio etc.).

I have been stress-testing models with LLM Chess since Autumn, and so far, none of the "reasoning" (or "thinking") models impressed me (except OpenAI's o1). Right away, I launched the benchmark, but I had to wait for a few days to collect enough data (it seems that the API was throttled; it was extremely slow).

LLM Chess simulates multiple games of a random bot playing against an LLM. Thousands of prompts, millions of tokens, every game is unique (unlike most evals which have fixed sets of prompts/pass conditions). Several metrics are collected and aggregated across multiple runs. The performance of a model is evaluated in reasoning (% of Wins/Draws) and steer-ability/durability (how often a model fails to follow instructions or drops out of the game due to multiple erroneous replies).

Reasoning Models

Before o1, LLMs couldn’t beat a random player in chess. GPT-4o? Zero wins. Claude 3.5 - Zero. They’d either collapse early or drag games into a 200-move game limit (with an automatic Draw assigned).

Then came o1. OpenAI’s "reasoning" models broke the record:

  • o1-preview: 46.67% wins
  • o1-mini: 30% wins

Other "reasoning" models? After the o1 release in late 2024 came the controversy around OpenAI's secrecy... The hidden "reasoning" tokens discussion (invisible yet billed) and how people were banned since OpenAI suspected they tried to crack their secrets. At the time we have seen attempts to reproduce the success of o1 with AI labs introducing "reasoning" models. E.g. Qwen's QwQ, Sky T1. Even Google released their experimental Gemini Thinking model in December 2024.

None of the alternative "reasoning" or "thinking" models came close to OpenAI models - they struggled even with basic instruction following drowning in verbosity, dropping out of the game loop after just a few moves: games lasted for 2 to 14 moves on average. Take for example a non-reasoning old and out-of-fashion GPT-4 Turbo which lasted on average 192 moves (before losing to the random Player due to a checkmate :).

Those late-2024 non-OpenAI reasoning models happened to be surrogates. This has set my expectations of R1 low...

R1

Deepseek's reasoning model turned out to be the real deal. It did score a meaningful number of wins while maintaining a low number of mistakes.

Model Wins Draws Mistakes Tokens/move
o1-preview 46.67% 43.33% 3.74 2660
o1-mini 30.00% 50.00% 2.34 1221
Deepseek-R1 22.58% 19.35% 18.63 4585

Mistakes - # of LLM erroneous replies per 1000 moves

R1 did good, but not great. Pay attention to how few Draws it had compared to o1 models. That's due to R1 breaking the protocol, violating prompt instruction, OR hallucinating illegal moves (and hence scoring a loss). It struggles with the instruction following and is susceptible to prompt variations falling out of the game loop randomly.

For reference here are the top non-reasoning models as of January 2025:

Model Wins ▼ Draws Mistakes Tokens/move
anthropic.claude-v3-5-sonnet-v1 6.67% 80.00% 0.27 80.42
gpt-4o-2024-11-20 4.23% 87.32% 0.15 50.58
gpt-4-turbo-2024-04-09 0.00% 93.33% 0.00 6.03
anthropic.claude-v3-opus 0.00% 83.33% 1.61 72.86

Reasoning Models - a League of their Own

Besides a significant number of wins, the reasoning models maintained a positive average material difference. Material count in a chess game is a weighted score of all the pieces (i.e. a Pawn being 1 unit of material and a Queen being 9). Each player starts the game with material count of 39. The eval calculates the difference in material at the end of each game - if a player loses more pieces than it captures, the difference will be negative. Other non-reasoning models (and reasoning "surrogates") typically have their material diff negative or around 0 (if they fail to progress in the game breaking the loop).

Here's the number for average material diff at the end of the game:

Model Material Diff Avg Game Duration (moves)
o1-preview-2024-09-12 9.99 124.8
o1-mini-2024-09-12 10.77 142.73
deepseek-reasoner-r1 10.83 91.77
anthropic.claude-v3-5-sonnet-v1 -4.48 183.38
gpt-4o-2024-11-20 -8.23 189.72
qwq-32b-preview@q4_k_m -0.07 7.97
gemini-2.0-flash-thinking-exp-1219 0.00 2.33

Distilled R1

I have also tested a few quantized versions of Distilled R1. What Deepseek did was fine-tune several smaller (70B, 14B, 8B etc.) Qwen 2.5 and Llama 3.1 models using the outputs of the full-size R1 model. Supposedly they should have gained reasoning skills. There's also a special <think></think> section in the output keeping all the reasoning tokens isolated from the final answer (something important the earlier thinking models missed).

They didn't do well:

Model Wins ▼ Draws Mistakes Tokens
deepseek-r1-distill-qwen-32b@q4_k_m 0.00% 0.00% 727.27 2173.83
deepseek-r1-distill-qwen-14b@q8_0 0.00% 0.00% 1153.85 3073.06

Besides, I noticed these models sometimes had failed to properly open and close the think tags (missing the openinig <think>).

P.S>

Google has also dropped an update to Gemini Thinking the day after the R1 release!

It did much better than the December version! At least it is now steerable and can last for ~40 moves in a game. They have also added the separation of the thinking part not bloating the response with the reasoning tokens. And, it is also a thinking surrogate...

Model Wins ▼ Draws Mistakes Tokens
gemini-2.0-flash-thinking-exp-01-21 0.00% 6.06% 5.97 17.77
gemini-2.0-flash-thinking-exp-1219 0.00% 0.00% 1285.71 724.54

Curiously, most of the game drop outs happened due to server error (e.g. some copyright filters) OR getting empty completions - there're definitely stability issues with the model.

Top comments (1)

Collapse
 
peter_truchly_4fce0874fd5 profile image
Peter Truchly

My first thought was: why would anyone even try to play chess with LLM? There are better "algorithms" for that.
But yes, especially in todays world where everybody is expecting AGI with every new LLM why it shouldn't play chess after all?

Where I see the problem (and a limitation) of current approach is the misuse of general purpose reasoning capabilities of LLMs, which are undeniably there but only in emergent form. What would average person do if confronted with this task? Most of us would just use some software, smaller portion would implement their own software and only handful (of chess masters) would play by themselves.

Unless we equip these 'AI' models with complete set of tools, environment and ability to use workflow patterns to model and execute a workflow designed for a given task, results are going to be quite disappointing, at least for the nearby future.