Shrijal Acharya for Composio

Posted on Feb 21 • Originally published at composio.dev

Grok 3 vs. Deepseek r1: A deep analysis

#ai #opensource #productivity #discuss

Almost everyone now knows about the DeepSeek R1 model, an open-source AI from China that took the internet by storm.

The main selling point of DeepSeek is that it's completely free and open-source and can rival some of OpenAI's paid models, like the o1.

Then, on February 16, Elon Musk announced Grok 3, labelling it as the "Smartest AI on Earth."

But how does it compare to our free model, DeepSeek R1? This made me curious, and I decided to test how Grok 3 compares against DeepSeek R1 in areas like coding, reasoning, math, and creative writing.

Let's find out if all the hype around Grok 3 holds any weight.

I tested both models against a list of prompts I created and showed the results and my thoughts on each model's ability in various tasks.

TL;DR

If you want to jump straight to the conclusion, here's a quick summary of the findings comparing DeepSeek R1 and Grok 3:

Reasoning and Math: DeepSeek R1 and Grok 3 models perform similarly on reasoning questions.

Coding: Grok 3 outperforms DeepSeek R1 and writes much better code.

Creative Writing: Both models perform well in creative writing. DeepSeek is known to be a great model, but I prefer Grok 3.

Brief on Grok 3 AI Model

Grok 3 is the latest language model from xAI, offering 10x the computational power. It includes tools like DeepSearch for step-by-step reasoning and Big Brain Mode for handling complex tasks.

Currently, the Grok 3 model is in beta mode, but is it really that much better than the DeepSeek R1 model? We'll find out by the end of this article.

According to the official benchmarks shared by the xAI team at the launch event, Grok 3 appears to be a game-changer, outperforming all its competitors in almost every benchmark.

DeepSeek R1 vs. Grok 3

I've used Chatbot Arena to test both of these models side by side, as it is the only available and trusted third-party site that allows testing the early Grok 3 model.

Reasoning Problems

ℹ️ Here, we will check the reasoning capabilities of both the models.

1. Russian Roulette

Let's start up the show with an interesting question:

You are playing Russian roulette with a six-shooter revolver. Your opponent puts in five bullets, spins the chambers and fires at himself, but no bullet comes out. He gives you the choice of whether or not he should spin the chambers again before firing at you. Should he spin again?

Response from Grok 3:

Response from DeepSeek R1:

Final Verdict: Both of the models answered the problem correctly with correct reasoning. ✅

2. Olympic Athlete Siblings

I am the sister of two Olympic athletes. But these two athletes are not my sisters. How is this possible?

Response from Grok 3:

Response from DeepSeek R1:

Final Verdict: Here as well, both the models answered the problem correctly with correct reasoning. ✅

3. Find out the Liar

The first two questions were straightforward. Let's conclude the reasoning test with a slightly trickier question:

You meet three people: Haris, Antony, and Michael. Haris says, "At least one of us is a liar." Antony says, "Haris is lying." Michael says, "Antony is telling the truth." Determine who is lying and who is telling the truth.

Response from Grok 3:

Here, Grok 3 provided thorough reasoning, considering each case and breaking down the problem step by step to support its conclusion.

Response from DeepSeek R1:

Final Verdict: Both models answered the problem correctly ✅, but the reasoning of the Grok 3 model stands out to me.

Summary:

From this, we can see that both models are quite strong in reasoning capabilities, as they both provided correct answers to all my reasoning questions. Still, the reasoning capability of the Grok 3 model seems to outperform R1 at times.

Mathematics

💁 Looking at the reasoning question answers from both models, I was convinced that both of these models should also pass the math questions.

Starting with an easy question, let's see how they perform:

1. Number of people in the train

There were some people on a train. 19 people get off the train at the first stop. 17 people get on the train. Now there are 63 people on the train. How many people were on the train to begin with?

Response from Grok 3:

Response from DeepSeek R1:

Final Verdict: Both models chose a similar approach and ended up with the correct answer. ✅

2. Prime Number Theorem Approximation

Estimate the number of prime numbers less than 10^8.

Response from Grok 3:

Response from DeepSeek R1:

Final Verdict: Both models chose the correct approach of using the Prime Number Theorem, but the answer from the DeepSeek R1 model was off by around ~1K, while Grok 3 was off by around ~20K. So, it's fair to say DeepSeek R1 ✅ outperformed Grok 3 ❌ here.

3. Count vowels and exponentiate

This is definitely the toughest problem for the models to crack. It is a somewhat modified version of a very famous question for LLMs with a math twist: "Calculate the number of r's in strawberry."

Calculate (x - 14)^10 where x is the number of vowels in the answer.

Response from Grok 3:

To solve the problem, we need to compute \\((x - 14)^{10}\\), where \\(x\\) is the number of vowels in the answer. This is a self-referential problem, so we need to find a value of the answer such that the number of vowels in the answer matches the \\(x\\) used in the expression.

### Step 1: Define the problem
- Let \\(y = (x - 14)^{10}\\), where \\(x\\) is the number of vowels in \\(y\\).
- Vowels are the letters "a," "e," "i," "o," "u" (we'll consider only lowercase for simplicity, unless otherwise specified).
- We need to find a \\(y\\) such that the number of vowels in \\(y\\) equals \\(x\\), and \\(y = (x - 14)^{10}\\).

### Step 2: Trial and error approach
Since \\(x\\) is the number of vowels, and vowels are non-negative integers, \\(x \\geq 0\\). However, for \\((x - 14)^{10}\\) to be meaningful, \\(x\\) should be such that the expression produces a number whose vowel count matches \\(x\\).

Let's try different values of \\(x\\) and compute \\(y\\), then check the number of vowels in \\(y\\).

#### Case 1: \\(x = 16\\)
- \\(y = (16 - 14)^{10} = 2^{10} = 1024\\).
- Write \\(y\\) in words: "one thousand twenty-four."
- Count vowels in "one thousand twenty-four":
  - o (1), e (1), o (2), u (1), a (1), o (2) → 6 vowels.
- \\(x = 16\\), but the number of vowels in \\(y\\) is 6, so \\(16 \\neq 6\\). This doesn't work.

#### Case 2: \\(x = 18\\)
- \\(y = (18 - 14)^{10} = 4^{10}\\).
- \\(4^5 = 1024\\), so \\(4^{10} = (4^5)^2 = 1024^2\\).
- \\(1024^2 = 1,048,576\\).
- Write \\(y\\) in words: "one million forty-eight thousand five hundred seventy-six."
- Count vowels:
  - o (1), e (1), i (1), i (2), o (2), a (1), o (3), u (1), a (2), e (2), u (2), e (3), i (3) → 13 vowels.
- \\(x = 18\\), but the number of vowels in \\(y\\) is 13, so \\(18 \\neq 13\\). This doesn't work.

#### Case 3: \\(x = 15\\)
- \\(y = (15 - 14)^{10} = 1^{10} = 1\\).
- Write \\(y\\) in words: "one."
- Count vowels in "one": **o**n**e** → 2 vowels.
- \\(x = 15\\), but the number of vowels in \\(y\\) is 2, so \\(15 \\neq 2\\). This doesn't work.

#### Case 4: \\(x = 17\\)
- \\(y = (17 - 14)^{10} = 3^{10}\\).
- \\(3^5 = 243\\), so \\(3^{10} = (3^5)^2 = 243^2\\).
- \\(243^2 = 59,049\\).
- Write \\(y\\) in words: "fifty-nine thousand forty-nine."
- Count vowels:
  - i (1), i (2), e (1), o (1), u (1), a (1), i (3), e (2) → 8 vowels.
- \\(x = 17\\), but the number of vowels in \\(y\\) is 8, so \\(17 \\neq 8\\). This doesn't work.

### Step 3: Look for patterns
Let's summarize the results so far:
- \\(x = 14\\), \\(y = 0\\), vowels = 2.
- \\(x = 15\\), \\(y = 1\\), vowels = 2.
- \\(x = 16\\), \\(y = 1024\\), vowels = 6.
- \\(x = 17\\), \\(y = 59,049\\), vowels = 8.
- \\(x = 18\\), \\(y = 1,048,576\\), vowels = 13.

It seems that as \\(x\\) increases, \\(y\\) grows exponentially (since it's \\(x - 14\\) raised to the 10th power), and the number of vowels in \\(y\\) also increases, but not necessarily matching \\(x\\).

### Step 4: Try smaller \\(x\\)
Let's try \\(x\\) closer to 14 but with a focus on smaller vowel counts.

#### Case 5: \\(x = 12\\)
- \\(y = (12 - 14)^{10} = (-2)^{10} = 1024\\).
- Write \\(y\\) in words: "one thousand twenty-four."
- Count vowels: 6 vowels (as calculated earlier).
- \\(x = 12\\), but the number of vowels in \\(y\\) is 6, so \\(12 \\neq 6\\). This doesn't work.

### Step 5: Try to find a fixed point
We need to find an \\(x\\) such that the number of vowels in \\((x - 14)^{10}\\) equals \\(x\\). Let's try \\(x = 2\\) (since we saw small vowel counts earlier).

#### Case 6: \\(x = 2\\)
- \\(y = (2 - 14)^{10} = (-12)^{10}\\).
- \\((-12)^{10} = 12^{10}\\) (since 10 is even).
- \\(12^5 = 248,832\\), so \\(12^{10} = (12^5)^2 = 248,832^2\\).
- \\(248,832^2 = 61,917,364,224\\).
- Write \\(y\\) in words: "sixty-one billion nine hundred seventeen million three hundred sixty-four thousand two hundred twenty-four."
- Count vowels:
  - i (1), o (1), e (1), i (2), i (3), o (2), e (2), e (3), i (4), o (3), o (4), u (1), a (1), o (5), e (4), o (6) → 16 vowels.
- \\(x = 2\\), but the number of vowels in \\(y\\) is 16, so \\(2 \\neq 16\\). This doesn't work.

### Step 6: Conclusion
After trying multiple values of \\(x\\), we observe that the number of vowels in \\(y = (x - 14)^{10}\\) does not consistently match \\(x\\). The problem may not have a solution where the number of vowels in the result exactly equals \\(x\\), or it may require a specific \\(x\\) that we haven't found yet.

Given the complexity and the lack of a clear fixed point, it seems likely that the problem might not have a solution within the constraints, or the solution requires a more advanced approach (e.g., considering different number naming conventions or constraints on \\(x\\)).

If we must provide an answer, we note that no \\(x\\) tried so far satisfies the condition. Further exploration or clarification (e.g., constraints on \\(x\\), naming conventions) may be needed.

Response from DeepSeek R1:

Final Verdict: As expected, neither of the models could reach the solution. ❌

This question seems to have been cracked by OpenAI's o1 model, though.

Summary:

As for math, both models performed quite well on the first two questions, which were easier, but both failed to solve a tricky question that required a bit of thought ❌. Based on these results, both models seem to perform similarly, so it's hard to favour one over the other.

Coding

Now, let's check how these models handle solving a tricky LeetCode question with an acceptance rate of just 12.8%: Find the Occurrence of First Almost Equal Substring. It was added recently, so it's pretty rare that they have been trained on it.

You are given two strings s and pattern.

A string x is called almost equal to y if you can change at most one character in x to make it identical to y.

Return the smallest starting index of a substring in s that is almost equal to pattern. If no such index exists, return -1.

A substring is a contiguous non-empty sequence of characters within a string.

Example 1:
Input: s = "abcdefg", pattern = "bcdffg"
Output: 1

Explanation:
The substring s[1..6] == "bcdefg" can be converted to "bcdffg" by changing s[4] to "f".

Example 2:
Input: s = "ababbababa", pattern = "bacaba"
Output: 4

Explanation:
The substring s[4..9] == "bababa" can be converted to "bacaba" by changing s[6] to "c".

Example 3:
Input: s = "abcd", pattern = "dba"
Output: -1

Example 4:
Input: s = "dde", pattern = "d"
Output: 0

Constraints:
1 <= pattern.length < s.length <= 105
s and pattern consist only of lowercase English letters.

Response from Grok 3:

class Solution:
    def minStartingIndex(self, s: str, pattern: str) -> int:
        n = len(pattern)

        # Helper function to check if two strings differ by at most one character
        def differsByAtMostOne(s1: str, s2: str) -> bool:
            if len(s1) != len(s2):
                return False

            # Count differences
            diff_count = 0
            for c1, c2 in zip(s1, s2):
                if c1 != c2:
                    diff_count += 1
                if diff_count > 1:
                    return False
            return True

        # Slide window of size n (pattern length) through s
        for i in range(len(s) - n + 1):
            # Get current substring
            substring = s[i:i+n]
            # Check if it differs by at most one character
            if differsByAtMostOne(substring, pattern):
                return i

        return -1

The code definitely accomplished what was asked for, but it hit Time Limit Exceeded on some test sets. So, while it solved the problem, it isn't the most optimal solution to this problem.

Response from DeepSeek R1:

class Solution:
    def minStartingIndex(self, s: str, pattern: str) -> int:
        ans = self.find_almost_equal_substring(s, pattern)
        return ans

    def compute_z(self, s):
        n = len(s)
        Z = [0] * n
        Z[0] = 0  # The first character is the entire string, which is not considered here
        l, r = 0, 0
        for i in range(1, n):
            if i > r:
                l = r = i
                while r < n and s[r - l] == s[r]:
                    r += 1
                Z[i] = r - l
                r -= 1
            else:
                k = i - l
                if Z[k] < r - i + 1:
                    Z[i] = Z[k]
                else:
                    l = i
                    while r < n and s[r - l] == s[r]:
                        r += 1
                    Z[i] = r - l
                    r -= 1
        return Z

    def find_almost_equal_substring(self, s, pattern):
        m = len(pattern)
        n = len(s)
        if m == 0:
            return 0
        if n < m:
            return -1

        # Step 1: Check for exact match
        exact = s.find(pattern)
        if exact != -1:
            return exact

        # Step 2: Compute lcp and lcs arrays using Z-algorithm
        # Compute lcp (longest prefix of pattern starting at s[i])
        concat_lcp = pattern + '#' + s
        Z_lcp = self.compute_z(concat_lcp)
        lcp = [0] * n
        for i in range(n):
            lcp[i] = Z_lcp[m + 1 + i]

        # Compute lcs (longest suffix of pattern ending at s[i])
        reversed_pattern = pattern[::-1]
        reversed_s = s[::-1]
        concat_lcs = reversed_pattern + '#' + reversed_s
        Z_lcs = self.compute_z(concat_lcs)
        lcs = [0] * n
        len_rev_pat = len(reversed_pattern)
        for i in range(n):
            pos_in_rev_s = i
            pos_in_concat = len_rev_pat + 1 + pos_in_rev_s
            if pos_in_concat >= len(Z_lcs):
                lcs_val = 0
            else:
                lcs_val = Z_lcs[pos_in_concat]
            lcs[n - 1 - i] = lcs_val

        min_index = float('inf')

        # Iterate over each possible j (position in pattern)
        for j in range(m):
            required_prefix = j
            required_suffix = m - j - 1

            # Iterate possible starting indices i in s
            for i in range(n - m + 1):
                # Check prefix condition
                if lcp[i] < required_prefix:
                    continue
                # Check suffix condition
                suffix_start = i + j + 1
                if suffix_start >= n:
                    continue
                if lcs[suffix_start] < required_suffix:
                    continue
                # Check if current character is a mismatch
                if s[i + j] != pattern[j]:
                    if i < min_index:
                        min_index = i
                        # Early exit if possible
                        if min_index == 0:
                            return 0
                        # Break the inner loop as we're looking for the earliest i
                        break

        return min_index if min_index != float('inf') else -1

This was very unexpected from DeepSeek R1. The code failed at the very first test case. If I have to compare the code quality, it is very poorly written as well.

Couldn't agree more with this tweet:

Summary:

Here, Grok 3 is clearly the winner. ✅ It at least provided working code, though it wasn't the most optimal. On the other hand, DeepSeek R1 wrote code that couldn't pass the very first test case and was unnecessarily long and poorly written. ❌

Creative Writing

Write a story on "A man buys an old typewriter from a flea market, only to realize that whatever he types on it comes true, but with a dark twist.". Summarize the entire story with the twist in three paragraphs.

Response from Grok 3:

Response from DeepSeek R1:

Summary:

Even though DeepSeek R1 is said to be the best model for writing:

Here, I must say that both did a great job crafting this story and wrapping up the entire twist within 3 paragraphs, but I prefer the response from the Grok 3 model more ✅ than the DeepSeek R1 model.

The story just felt to have a better flow.

Final Verdict!

Based on these observations, here’s my final verdict:

Both Grok 3 and the DeepSeek R1 model perform very similarly for Reasoning questions and Mathematics. Choosing one over the other doesn't seem to make much difference.

For Coding, Grok 3 is the clear winner compared to the DeepSeek R1 model. DeepSeek R1 doesn't come close regarding code quality or overall answers.

Both models are quite strong for Creative Writing, but I prefer Grok 3’s responses. They feel more engaging, natural, and polished.

I pretty much agree with Satoshi on the Grok 3 and DeepSeek R1 part of this comparison:

What do you think? Let me know your thoughts in the comments below! 👇🏻

Shrijal Acharya

Full Stack SDE • Open-Source Contributor • Collaborator @Oppia • Mail for collaboration

Top comments (14)

Bhaskar Prajapati • Feb 21

Since Deepseek model is release, I've been using it since then. The only issue I have with it is that it just always has this issue of "Server is busy, please try again later"

Else everything about Deepseek feels good and I think I need no other models in my use case.
This is a great comparision post. Thank you for sharing! 👍🏽

Shrijal Acharya • Feb 21 • Edited

Thanks for checking out, Bhaskar 🙌

The only issue I have with it is that it just always has this issue of "Server is busy, please try again later"

What else to expect from a free model, though. 😮‍💨

I myself don't see much hype that should be around Grok 3. Even though it is being said, "best AI in the world right now", the metrics don't differ by much. And comparing that Deepseek r1 is a complete open-source model built as a side project of a developer, the way it is performing is phenomenal.

But hey, it is what it is! 🤷‍♂️

Anmol Baranwal • Feb 21

Awesome Shrijal! 🔥 This looks quite detailed. Let me share it further :)

Shrijal Acharya • Feb 21

Thank you for checking it out, Anmol! Go ahead. 🙌

Shekhar Rajput • Feb 21

Really nice read, @shricodev! Even though I'm not into AI, the comparison feels to the point. 👏🏼

Shrijal Acharya • Feb 21

Thank you for checking out, @shekharrr 🙌

Really appreciate it.

Muhammad Usman • Feb 22

Nice analysis, great work.
Also,
🚀 I have Just finished my very first frontend challenge for Dev Community!
dev.to/web_dev-usman/discover-your...
Give your feedback there, and support me.