DEV Community

Cover image for Grok 3 vs. Deepseek r1: A deep analysis
Shrijal Acharya for Composio

Posted on • Originally published at composio.dev

Grok 3 vs. Deepseek r1: A deep analysis

Almost everyone now knows about the DeepSeek R1 model, an open-source AI from China that took the internet by storm.

The main selling point of DeepSeek is that it's completely free and open-source and can rival some of OpenAI's paid models, like the o1.

Deepseek slapping OpenAI meme

Then, on February 16, Elon Musk announced Grok 3, labelling it as the "Smartest AI on Earth."

But how does it compare to our free model, DeepSeek R1? This made me curious, and I decided to test how Grok 3 compares against DeepSeek R1 in areas like coding, reasoning, math, and creative writing.

Let's find out if all the hype around Grok 3 holds any weight.

I tested both models against a list of prompts I created and showed the results and my thoughts on each model's ability in various tasks.

TL;DR

If you want to jump straight to the conclusion, here's a quick summary of the findings comparing DeepSeek R1 and Grok 3:

  • Reasoning and Math: DeepSeek R1 and Grok 3 models perform similarly on reasoning questions.
  • Coding: Grok 3 outperforms DeepSeek R1 and writes much better code.
  • Creative Writing: Both models perform well in creative writing. DeepSeek is known to be a great model, but I prefer Grok 3.

Brief on Grok 3 AI Model

Grok 3 is the latest language model from xAI, offering 10x the computational power. It includes tools like DeepSearch for step-by-step reasoning and Big Brain Mode for handling complex tasks.

Currently, the Grok 3 model is in beta mode, but is it really that much better than the DeepSeek R1 model? We'll find out by the end of this article.

According to the official benchmarks shared by the xAI team at the launch event, Grok 3 appears to be a game-changer, outperforming all its competitors in almost every benchmark.

Grok 3 AI Model stats


DeepSeek R1 vs. Grok 3

I've used Chatbot Arena to test both of these models side by side, as it is the only available and trusted third-party site that allows testing the early Grok 3 model.

Grok 3 vs Deepseek R1

Reasoning Problems

ℹ️ Here, we will check the reasoning capabilities of both the models.

1. Russian Roulette

Let's start up the show with an interesting question:

You are playing Russian roulette with a six-shooter revolver. Your opponent puts in five bullets, spins the chambers and fires at himself, but no bullet comes out. He gives you the choice of whether or not he should spin the chambers again before firing at you. Should he spin again?

  • Response from Grok 3:

Russian Roulette problem answer by Grok 3 Model

  • Response from DeepSeek R1:

Russian Roulette problem answer by Deepseek R1 Model

Final Verdict: Both of the models answered the problem correctly with correct reasoning. ✅

2. Olympic Athlete Siblings

I am the sister of two Olympic athletes. But these two athletes are not my sisters. How is this possible?

  • Response from Grok 3:

Olympic Athlete problem answer by Grok 3 Model

  • Response from DeepSeek R1:

Olympic Athlete problem answer by Deepseek R1 Model

Final Verdict: Here as well, both the models answered the problem correctly with correct reasoning. ✅

3. Find out the Liar

The first two questions were straightforward. Let's conclude the reasoning test with a slightly trickier question:

You meet three people: Haris, Antony, and Michael. Haris says, "At least one of us is a liar." Antony says, "Haris is lying." Michael says, "Antony is telling the truth." Determine who is lying and who is telling the truth.

  • Response from Grok 3:

Find out the Liar problem answer by Grok 3 Model - part 1

Find out the Liar problem answer by Grok 3 Model - part 2

Here, Grok 3 provided thorough reasoning, considering each case and breaking down the problem step by step to support its conclusion.

  • Response from DeepSeek R1:

Find out the Liar problem answer by Deepseek R1 model

Final Verdict: Both models answered the problem correctly ✅, but the reasoning of the Grok 3 model stands out to me.

Summary:

From this, we can see that both models are quite strong in reasoning capabilities, as they both provided correct answers to all my reasoning questions. Still, the reasoning capability of the Grok 3 model seems to outperform R1 at times.

Mathematics

💁 Looking at the reasoning question answers from both models, I was convinced that both of these models should also pass the math questions.

Starting with an easy question, let's see how they perform:

1. Number of people in the train

There were some people on a train. 19 people get off the train at the first stop. 17 people get on the train. Now there are 63 people on the train. How many people were on the train to begin with?

  • Response from Grok 3:

Number of people on the train problem answer by Grok 3 Model

  • Response from DeepSeek R1:

Number of people on the train problem answer by Deepseek R1 Model

Final Verdict: Both models chose a similar approach and ended up with the correct answer. ✅

2. Prime Number Theorem Approximation

Estimate the number of prime numbers less than 10^8.

  • Response from Grok 3:

Prime Number Theorem Approximation problem answer by Grok 3 Model- part 1

Prime Number Theorem Approximation problem answer by Grok 3 Model - part 2

  • Response from DeepSeek R1:

Prime Number Theorem Approximation problem answer by Deepseek R1 Model - part 1

Prime Number Theorem Approximation problem answer by Deepseek R1 Model - part 2

Final Verdict: Both models chose the correct approach of using the Prime Number Theorem, but the answer from the DeepSeek R1 model was off by around ~1K, while Grok 3 was off by around ~20K. So, it's fair to say DeepSeek R1 ✅ outperformed Grok 3 ❌ here.

3. Count vowels and exponentiate

This is definitely the toughest problem for the models to crack. It is a somewhat modified version of a very famous question for LLMs with a math twist: "Calculate the number of r's in strawberry."

Calculate (x - 14)^10 where x is the number of vowels in the answer.

  • Response from Grok 3:
To solve the problem, we need to compute \\((x - 14)^{10}\\), where \\(x\\) is the number of vowels in the answer. This is a self-referential problem, so we need to find a value of the answer such that the number of vowels in the answer matches the \\(x\\) used in the expression.

### Step 1: Define the problem
- Let \\(y = (x - 14)^{10}\\), where \\(x\\) is the number of vowels in \\(y\\).
- Vowels are the letters "a," "e," "i," "o," "u" (we'll consider only lowercase for simplicity, unless otherwise specified).
- We need to find a \\(y\\) such that the number of vowels in \\(y\\) equals \\(x\\), and \\(y = (x - 14)^{10}\\).

### Step 2: Trial and error approach
Since \\(x\\) is the number of vowels, and vowels are non-negative integers, \\(x \\geq 0\\). However, for \\((x - 14)^{10}\\) to be meaningful, \\(x\\) should be such that the expression produces a number whose vowel count matches \\(x\\).

Let's try different values of \\(x\\) and compute \\(y\\), then check the number of vowels in \\(y\\).

#### Case 1: \\(x = 16\\)
- \\(y = (16 - 14)^{10} = 2^{10} = 1024\\).
- Write \\(y\\) in words: "one thousand twenty-four."
- Count vowels in "one thousand twenty-four":
  - o (1), e (1), o (2), u (1), a (1), o (2) → 6 vowels.
- \\(x = 16\\), but the number of vowels in \\(y\\) is 6, so \\(16 \\neq 6\\). This doesn't work.

#### Case 2: \\(x = 18\\)
- \\(y = (18 - 14)^{10} = 4^{10}\\).
- \\(4^5 = 1024\\), so \\(4^{10} = (4^5)^2 = 1024^2\\).
- \\(1024^2 = 1,048,576\\).
- Write \\(y\\) in words: "one million forty-eight thousand five hundred seventy-six."
- Count vowels:
  - o (1), e (1), i (1), i (2), o (2), a (1), o (3), u (1), a (2), e (2), u (2), e (3), i (3) → 13 vowels.
- \\(x = 18\\), but the number of vowels in \\(y\\) is 13, so \\(18 \\neq 13\\). This doesn't work.

#### Case 3: \\(x = 15\\)
- \\(y = (15 - 14)^{10} = 1^{10} = 1\\).
- Write \\(y\\) in words: "one."
- Count vowels in "one": **o**n**e** → 2 vowels.
- \\(x = 15\\), but the number of vowels in \\(y\\) is 2, so \\(15 \\neq 2\\). This doesn't work.

#### Case 4: \\(x = 17\\)
- \\(y = (17 - 14)^{10} = 3^{10}\\).
- \\(3^5 = 243\\), so \\(3^{10} = (3^5)^2 = 243^2\\).
- \\(243^2 = 59,049\\).
- Write \\(y\\) in words: "fifty-nine thousand forty-nine."
- Count vowels:
  - i (1), i (2), e (1), o (1), u (1), a (1), i (3), e (2) → 8 vowels.
- \\(x = 17\\), but the number of vowels in \\(y\\) is 8, so \\(17 \\neq 8\\). This doesn't work.

### Step 3: Look for patterns
Let's summarize the results so far:
- \\(x = 14\\), \\(y = 0\\), vowels = 2.
- \\(x = 15\\), \\(y = 1\\), vowels = 2.
- \\(x = 16\\), \\(y = 1024\\), vowels = 6.
- \\(x = 17\\), \\(y = 59,049\\), vowels = 8.
- \\(x = 18\\), \\(y = 1,048,576\\), vowels = 13.

It seems that as \\(x\\) increases, \\(y\\) grows exponentially (since it's \\(x - 14\\) raised to the 10th power), and the number of vowels in \\(y\\) also increases, but not necessarily matching \\(x\\).

### Step 4: Try smaller \\(x\\)
Let's try \\(x\\) closer to 14 but with a focus on smaller vowel counts.

#### Case 5: \\(x = 12\\)
- \\(y = (12 - 14)^{10} = (-2)^{10} = 1024\\).
- Write \\(y\\) in words: "one thousand twenty-four."
- Count vowels: 6 vowels (as calculated earlier).
- \\(x = 12\\), but the number of vowels in \\(y\\) is 6, so \\(12 \\neq 6\\). This doesn't work.

### Step 5: Try to find a fixed point
We need to find an \\(x\\) such that the number of vowels in \\((x - 14)^{10}\\) equals \\(x\\). Let's try \\(x = 2\\) (since we saw small vowel counts earlier).

#### Case 6: \\(x = 2\\)
- \\(y = (2 - 14)^{10} = (-12)^{10}\\).
- \\((-12)^{10} = 12^{10}\\) (since 10 is even).
- \\(12^5 = 248,832\\), so \\(12^{10} = (12^5)^2 = 248,832^2\\).
- \\(248,832^2 = 61,917,364,224\\).
- Write \\(y\\) in words: "sixty-one billion nine hundred seventeen million three hundred sixty-four thousand two hundred twenty-four."
- Count vowels:
  - i (1), o (1), e (1), i (2), i (3), o (2), e (2), e (3), i (4), o (3), o (4), u (1), a (1), o (5), e (4), o (6) → 16 vowels.
- \\(x = 2\\), but the number of vowels in \\(y\\) is 16, so \\(2 \\neq 16\\). This doesn't work.

### Step 6: Conclusion
After trying multiple values of \\(x\\), we observe that the number of vowels in \\(y = (x - 14)^{10}\\) does not consistently match \\(x\\). The problem may not have a solution where the number of vowels in the result exactly equals \\(x\\), or it may require a specific \\(x\\) that we haven't found yet.

Given the complexity and the lack of a clear fixed point, it seems likely that the problem might not have a solution within the constraints, or the solution requires a more advanced approach (e.g., considering different number naming conventions or constraints on \\(x\\)).

If we must provide an answer, we note that no \\(x\\) tried so far satisfies the condition. Further exploration or clarification (e.g., constraints on \\(x\\), naming conventions) may be needed.
Enter fullscreen mode Exit fullscreen mode
  • Response from DeepSeek R1:

Count vowels and exponentiate problem answer by Deepseek R1 Model

Final Verdict: As expected, neither of the models could reach the solution. ❌

This question seems to have been cracked by OpenAI's o1 model, though.

Summary:

As for math, both models performed quite well on the first two questions, which were easier, but both failed to solve a tricky question that required a bit of thought ❌. Based on these results, both models seem to perform similarly, so it's hard to favour one over the other.

Coding

Now, let's check how these models handle solving a tricky LeetCode question with an acceptance rate of just 12.8%: Find the Occurrence of First Almost Equal Substring. It was added recently, so it's pretty rare that they have been trained on it.

You are given two strings s and pattern.

A string x is called almost equal to y if you can change at most one character in x to make it identical to y.

Return the smallest starting index of a substring in s that is almost equal to pattern. If no such index exists, return -1.

A substring is a contiguous non-empty sequence of characters within a string.

Example 1:
Input: s = "abcdefg", pattern = "bcdffg"
Output: 1

Explanation:
The substring s[1..6] == "bcdefg" can be converted to "bcdffg" by changing s[4] to "f".

Example 2:
Input: s = "ababbababa", pattern = "bacaba"
Output: 4

Explanation:
The substring s[4..9] == "bababa" can be converted to "bacaba" by changing s[6] to "c".

Example 3:
Input: s = "abcd", pattern = "dba"
Output: -1

Example 4:
Input: s = "dde", pattern = "d"
Output: 0

Constraints:
1 <= pattern.length < s.length <= 105
s and pattern consist only of lowercase English letters.

Enter fullscreen mode Exit fullscreen mode
  • Response from Grok 3:
class Solution:
    def minStartingIndex(self, s: str, pattern: str) -> int:
        n = len(pattern)

        # Helper function to check if two strings differ by at most one character
        def differsByAtMostOne(s1: str, s2: str) -> bool:
            if len(s1) != len(s2):
                return False

            # Count differences
            diff_count = 0
            for c1, c2 in zip(s1, s2):
                if c1 != c2:
                    diff_count += 1
                if diff_count > 1:
                    return False
            return True

        # Slide window of size n (pattern length) through s
        for i in range(len(s) - n + 1):
            # Get current substring
            substring = s[i:i+n]
            # Check if it differs by at most one character
            if differsByAtMostOne(substring, pattern):
                return i

        return -1

Enter fullscreen mode Exit fullscreen mode

The code definitely accomplished what was asked for, but it hit Time Limit Exceeded on some test sets. So, while it solved the problem, it isn't the most optimal solution to this problem.

LeetCode TLE on Grok 3 AI model answer response

  • Response from DeepSeek R1:
class Solution:
    def minStartingIndex(self, s: str, pattern: str) -> int:
        ans = self.find_almost_equal_substring(s, pattern)
        return ans

    def compute_z(self, s):
        n = len(s)
        Z = [0] * n
        Z[0] = 0  # The first character is the entire string, which is not considered here
        l, r = 0, 0
        for i in range(1, n):
            if i > r:
                l = r = i
                while r < n and s[r - l] == s[r]:
                    r += 1
                Z[i] = r - l
                r -= 1
            else:
                k = i - l
                if Z[k] < r - i + 1:
                    Z[i] = Z[k]
                else:
                    l = i
                    while r < n and s[r - l] == s[r]:
                        r += 1
                    Z[i] = r - l
                    r -= 1
        return Z

    def find_almost_equal_substring(self, s, pattern):
        m = len(pattern)
        n = len(s)
        if m == 0:
            return 0
        if n < m:
            return -1

        # Step 1: Check for exact match
        exact = s.find(pattern)
        if exact != -1:
            return exact

        # Step 2: Compute lcp and lcs arrays using Z-algorithm
        # Compute lcp (longest prefix of pattern starting at s[i])
        concat_lcp = pattern + '#' + s
        Z_lcp = self.compute_z(concat_lcp)
        lcp = [0] * n
        for i in range(n):
            lcp[i] = Z_lcp[m + 1 + i]

        # Compute lcs (longest suffix of pattern ending at s[i])
        reversed_pattern = pattern[::-1]
        reversed_s = s[::-1]
        concat_lcs = reversed_pattern + '#' + reversed_s
        Z_lcs = self.compute_z(concat_lcs)
        lcs = [0] * n
        len_rev_pat = len(reversed_pattern)
        for i in range(n):
            pos_in_rev_s = i
            pos_in_concat = len_rev_pat + 1 + pos_in_rev_s
            if pos_in_concat >= len(Z_lcs):
                lcs_val = 0
            else:
                lcs_val = Z_lcs[pos_in_concat]
            lcs[n - 1 - i] = lcs_val

        min_index = float('inf')

        # Iterate over each possible j (position in pattern)
        for j in range(m):
            required_prefix = j
            required_suffix = m - j - 1

            # Iterate possible starting indices i in s
            for i in range(n - m + 1):
                # Check prefix condition
                if lcp[i] < required_prefix:
                    continue
                # Check suffix condition
                suffix_start = i + j + 1
                if suffix_start >= n:
                    continue
                if lcs[suffix_start] < required_suffix:
                    continue
                # Check if current character is a mismatch
                if s[i + j] != pattern[j]:
                    if i < min_index:
                        min_index = i
                        # Early exit if possible
                        if min_index == 0:
                            return 0
                        # Break the inner loop as we're looking for the earliest i
                        break

        return min_index if min_index != float('inf') else -1

Enter fullscreen mode Exit fullscreen mode

This was very unexpected from DeepSeek R1. The code failed at the very first test case. If I have to compare the code quality, it is very poorly written as well.

LeetCode TLE on Deepseek R1 AI model answer response

Couldn't agree more with this tweet:

'simp 4 satoshi' tweet on AI Models comparison

Summary:

Here, Grok 3 is clearly the winner. ✅ It at least provided working code, though it wasn't the most optimal. On the other hand, DeepSeek R1 wrote code that couldn't pass the very first test case and was unnecessarily long and poorly written. ❌

Creative Writing

Write a story on "A man buys an old typewriter from a flea market, only to realize that whatever he types on it comes true, but with a dark twist.". Summarize the entire story with the twist in three paragraphs.

  • Response from Grok 3:

Creative Writing answer by Grok 3 Model

  • Response from DeepSeek R1:

Creative Writing answer by Deepseek R1 Model

Summary:

Even though DeepSeek R1 is said to be the best model for writing:

Deepseek R1 tops the AI comparison chart for Creative Writing

Here, I must say that both did a great job crafting this story and wrapping up the entire twist within 3 paragraphs, but I prefer the response from the Grok 3 model more ✅ than the DeepSeek R1 model.

The story just felt to have a better flow.


Final Verdict!

Based on these observations, here’s my final verdict:

  • Both Grok 3 and the DeepSeek R1 model perform very similarly for Reasoning questions and Mathematics. Choosing one over the other doesn't seem to make much difference.
  • For Coding, Grok 3 is the clear winner compared to the DeepSeek R1 model. DeepSeek R1 doesn't come close regarding code quality or overall answers.
  • Both models are quite strong for Creative Writing, but I prefer Grok 3’s responses. They feel more engaging, natural, and polished.

I pretty much agree with Satoshi on the Grok 3 and DeepSeek R1 part of this comparison:

What do you think? Let me know your thoughts in the comments below! 👇🏻

Top comments (14)

Collapse
 
bh4skar profile image
Bhaskar Prajapati

Since Deepseek model is release, I've been using it since then. The only issue I have with it is that it just always has this issue of "Server is busy, please try again later"

Else everything about Deepseek feels good and I think I need no other models in my use case.
This is a great comparision post. Thank you for sharing! 👍🏽

Collapse
 
shricodev profile image
Shrijal Acharya • Edited

Thanks for checking out, Bhaskar 🙌

The only issue I have with it is that it just always has this issue of "Server is busy, please try again later"

What else to expect from a free model, though. 😮‍💨

I myself don't see much hype that should be around Grok 3. Even though it is being said, "best AI in the world right now", the metrics don't differ by much. And comparing that Deepseek r1 is a complete open-source model built as a side project of a developer, the way it is performing is phenomenal.

But hey, it is what it is! 🤷‍♂️

Collapse
 
anmolbaranwal profile image
Anmol Baranwal

Awesome Shrijal! 🔥 This looks quite detailed. Let me share it further :)

Collapse
 
shricodev profile image
Shrijal Acharya

Thank you for checking it out, Anmol! Go ahead. 🙌

Collapse
 
shekharrr profile image
Shekhar Rajput

Really nice read, @shricodev! Even though I'm not into AI, the comparison feels to the point. 👏🏼

Collapse
 
shricodev profile image
Shrijal Acharya

Thank you for checking out, @shekharrr 🙌

Really appreciate it.

Collapse
 
web_dev-usman profile image
Muhammad Usman

Nice analysis, great work.
Also,
🚀 I have Just finished my very first frontend challenge for Dev Community!
dev.to/web_dev-usman/discover-your...
Give your feedback there, and support me.

Collapse
 
alighasemi889 profile image
founder of codemaster

great

Collapse
 
shricodev profile image
Shrijal Acharya

Thank you! 🙌

Collapse
 
larastewart_engdev profile image
Lara Stewart - DevOps Cloud Engineer

That is a great comparison 🙂

The thing is that I am not so sure if we can rate an LLM based on a few questions.

Collapse
 
martin_yuspi1976 profile image
Matin Yuspi

Deep seek model is the way. I love open source, you love open source, everyone loves open source.

Runs even on mobile phones

Collapse
 
shricodev profile image
Shrijal Acharya

Woah! This has to be one of the coolest things. Running a complete LLM locally on a phone is something I had never imagined.

Thanks for sharing this, @martin_yuspi1976! ✌️

Collapse
 
aayyusshh_69 profile image
Aayush Pokharel

Good one sathi! 😍💥

Kasari yeti sab bhyauchau yaar garna, aja bihana clz ga theu haina ra?

Collapse
 
shricodev profile image
Shrijal Acharya

Thank you, Aayush! 🙌