Andrew Baisden

Posted on Feb 12

I Tested the Top AI Models to Build the Same App - Here are the Shocking Results!

#webdev #programming #ai #javascript

Tech moves at an extremely rapid pace and it seems like we are getting new updates and improved large language models (LLMs) all the time. Claude 3.5 Sonnet has been leading the way for a very long time when it comes to doing programming related work. But is it still the top choice in this category or has it been surpassed by one of the many new LLM models which are available today.

I wanted to see how much these LLM models had progressed, so I created a test to see which one would come out on top. This article is a follow up to a post I made on my socials a while back. I tasked various AI models with building a simple Pokemon game which had surprising results.

This was the prompt I used:

Create a simple 1 v 1 Pokemon battle game using JavaScript and use sprites from this website for the Pokemon https://pokemondb.net/sprites

And here is the thread I created on my socials:

Threads
Twitter/X

In the first phase of testing, I used Claude 3.5 Sonnet, DeepSeek R1 and ChatGPT-4o. In the second phase of testing, I used significantly more LLMs to get a better overview of the current capabilities available to us. The LLM's tested include:

DeepSeek R1
Gemini 2.0 Flash Thinking Experimental
Grok 2
Mistral
o3-mini (medium reasoning - Windsurf)
Qwen2.5-Max
Claude 3.5 Sonnet

Building a Pokemon game

With the second phase test, I created a more advanced prompt to see how intelligent these LLM models are when it comes to building more complex applications that require higher logic and thinking, and I believe that a game is always a great way to test these types of use cases.

The aim of these tests was to see what the AI could accomplish after just one prompt. Of course, I would expect them all to accomplish a lot more after further iterations of chain prompting from a user.

This was the prompt which I used:

Create a 1 v 1 Pokemon battle game using JavaScript and use sprites from this website for the Pokemon https://pokemondb.net/sprites Make sure that the player can switch between 2 different Pokemon during battle and that there is type and elemental damage based on the Pokemon being used. Each Pokemon should have at least four available attacks to use. The player's Pokemon should be at level 5, and the enemy Pokemon should be at level 7. Factor in how the difference in level should play a part in the battle, including a difference in health, etc...

You can find all of the Pokemon games here on my GitHub https://github.com/andrewbaisden/pokemon-battle-game.

The _battle.js files were the original files generated by the LLM, which were broken. Claude fixed the battle.js files in that folder.

These were the results of my testing. I will rate all of them out of 5 stars so you can see which ones excelled and which ones could have done better with more work.

DeepSeek R1

LLM Performance
It took a while for DeepSeek R1 to come up with a system and start writing code. The response speed was slow because a lot of thinking was required for this task. DeepSeek R1 was thinking for 300 seconds, which is approximately 5 minutes, which is the longest I have seen yet, when using DeepSeek R1 for a task. The chain of thought process is interesting to see though and I did not set a time limit for this task so I did not mind if it took longer so long as it could fulfil the prompt.

Game UX & Logic
Unfortunately, the game has basic functionality and is not fully working. It's possible to switch between Pokemon. The Pokemon have health bars, and there are four moves available, but they are all generic and don't have names like "Thunderbolt", "Ember", etc., like in the game. Plus, it's only possible to use one move, and then all the buttons are greyed out, meaning it's not possible to play. Also, there is no image or a GIF for the enemy Pokemon, just an empty box. The design is simple, but more prompts are required to get this game into a working state.

Gemini 2.0 Flash Thinking Experimental

LLM Performance
So Gemini 2.0 Flash took about 15 seconds, to respond to the prompt which was fairly quick.

Game UX & Logic
The fast response to my prompt did not diminish Geminis output because it created a fully functional game, with a pretty decent design. Animated Pokemon, health bars, 4 moves and the ability to switch Pokemon plus an output box with all of the moves which happen during the battle. It's definitely one of the best games created in this test.

Grok 2

LLM Performance
Grok 2 does not have reasoning or a chain of thought. It took about 1 minute to complete the prompt request.

Game UX & Logic
Unfortunately, the codebase it provided was broken and did not work. I decided to use Claude 3.5 Sonnet via the Windsurf IDE to debug the codebase, and it got it working after one prompt. The reason I did not do this for DeepSeek R1 was because the game was somewhat playable already whereas the version which Grok 2 created had bugs which meant it was not playable at all.

After fixing the codebase, I could see that Grok 2 had actually designed and built a pretty beautiful game. The game more or less achieved the basics that I outlined in the initial prompt, which was good. However, it loses points because the codebase was broken, and Claude had to fix it.

Mistral

LLM Performance
It took about 2 seconds to generate the codebase which was by far the fastest of all the LLMs I tested.

Game UX & Logic
Mistral was able to create a fully functional game, after just 2 seconds! The design was pretty simple, but the basic logic worked as expected.

o3-mini (medium reasoning - Windsurf)

LLM Performance
It took about 5 seconds to create an action plan for building the app. Then, about another 10 seconds to create the codebase after I created empty files for index.html styles.css and battle.js so that it could add the code to them.

Game UX & Logic
After the setup, it successfully created a working application on the first attempt! The game works as expected and fulfils the requirements I set in the prompt. If I had one comment, it would be that all the move buttons have generic names like "Attack 1", "Attack 2", etc, even though in the output screen, it shows what move is being used. If the buttons matched the names of the attacks in the output, that would be better.

Qwen2.5-Max

LLM Performance
It took about 1 minute to generate a codebase which was not too bad.

Game UX & Logic
The JavaScript file had an error, although the HTML was able to work in the browser. The functionality did not work, though, so I used Claude 3.5 Sonnet via the Windsurf IDE to debug the codebase, and it got it working after one prompt.

The game works and does what I outlined in the initial prompt. However, the game logic needs much improvement. Firstly, when switching Pokemon, the attack moves remain the same and don't change, so they are not relevant for the new Pokemon. Secondly, the damage seems to be stuck at one, and when the Pokemon have a health of 100, that means the battle will be going on for a very long time...

Claude 3.5 Sonnet

LLM Performance
It took about 1 minute to generate a codebase, which is more than acceptable.

Game UX & Logic
The game was functional. However, it created placeholder images for the Pokemon and required the user to download sprites to replace the placeholders manually. But at least it provided instructions on how to do it. This is probably because Claude cannot search the web like other LLMs can, so it was unable to read the documentation. It's worth noting that I used the Claude website for this test. If I had used an IDE like Windsurf, which can search the web, then it might have worked.

This was the only game which had animated health bars, which was cool. I'm not so sure about the game logic, though. Either the enemy Pokemon is just that strong, or the players' Pokemon are doing damage to themselves every time they attack because their health bars go down way too quickly. 😂 Also, there are no electric Pokemon in this game, but there are electric attacks, which makes no sense. 😂

Conclusion

I think it's incredible to see how far AI has come and the direction it's heading in. Today, we learned about the current capabilities of some of the leading LLM models available right now. The fact that it's possible to create a fairly sophisticated working codebase from one prompt is truly a great sight to see. Also, taking into account that the prompt I used was detailed but left out some information, the AI models were still able to figure out most of the stuff that I was referring to, which shows how useful they have become for this type of work.

This test was not super scientific but a quick, fun one to gain insight into how well these models can build something from scratch with little human intervention. Based on this short study, I would give each LLM the following ratings and rankings for this particular test.

AI LLM	Rating
DeepSeek R1	⭐️️
Gemini 2.0 Flash Thinking Experimental	⭐️⭐️⭐️⭐️⭐️
Grok 2	⭐️⭐️⭐️
Mistral	⭐️⭐️⭐️⭐️
o3-mini (medium reasoning - Windsurf)	⭐️⭐️⭐️⭐️
Qwen2.5-Max	⭐️⭐️
Claude 3.5 Sonnet	⭐️⭐⭐

So unfortunately DeepSeek R1 only scored 1 star in this particular test because the game was not fully functional. Surprisingly it was Gemini 2.0 Flash which came out on top with 5 stars. Grok 2 only managed 3 stars because the codebase needed to be fixed by Claude before it worked.

Mistral and o3-mini (medium reasoning) produced fairly good all around games. Qwen2.5-Max created a game which only worked after Claude debugged the codebase. The logic needed improvement because the attacks only did damage of 1 so winning a game would be tiresome and boring... 😂

And lastly Claude only scored 3 stars because the game logic was a bit weird and it could not display any images of Pokemon like the other games due to not having the ability to search the web. However it gets an honourable mention because it fixed 2 broken codebases and got those games working after one prompt! And if I had used Claude 3.5 Sonnet inside of an IDE like Windsurf or Cursor which can access the web then it likely would have produced even better results when building this game.

Stay up to date with tech, programming, productivity, and AI

If you enjoyed these articles, connect and follow me across social media, where I share content related to all of these topics 🔥

Top comments (4)

Anmol Baranwal • Feb 12

Awesome! 🔥 This is slightly more practical than just comparing on paper stats.

Andrew Baisden • Feb 12

Yep, you're right, and thanks for reading!

Fayaz • Feb 13

In my experience, model performance like this isn't consistent. So, if you ask the same model with the exact same prompt, you may neither get the same result, nor the same final rating (benchmark).

Also, different models perform differently in visual programming vs. mission critical business logic based programming.

In visual programming, 80% accurate result is a far better outcome than 60% accurate result. However, in mission critical business logic based programming (e.g. banking software, medical software where error tolerance is 0%), both 80% and 60% accuracy are complete failures.

Also, in a different scenario, an LLM producing 80% accurate result, may also produce hard to debug code in the remaining 20% inaccuracy. While, an LLM producing 60% accurate result, may produce very easy to debug code in the remaining 40% inaccuracy. If this is the case, then as a developer, the second result would be preferable! So, this scenerio must also be a consideration in judging the code generation performance of the LLM.

I've written my own experience in this post: State of generative AI in Software Development: The reality check - please read it if you get time and let me know what you think!

Madza • Feb 13

Great practical comparison, well written mate! 👍💯

DEV Community