Hello, everyone!
Recently, OpenAI launched its new model, the o3-mini. With so many options emerging, the big question for every developer is: Which model should I use?
To answer this question, I spent the last few hours testing the o3-mini and the DeepSeek R1 on common tasks that we developers perform daily. These tasks are:
- Building a program from scratch;
- Adding a feature to existing code;
- Refactoring code and generating tests.
In this article, I will share my recommendations and insights. My goal is for all of us to become better developers by leveraging AI to our advantage.
Performance, Price, and Context Window
Before diving into practical tests, it is essential to understand the specifications of each model, as they are crucial in determining which one aligns best with your project's needs.
1. Performance
- o3-mini and DeepSeek R1 lead in the SWE Bench (a test that evaluates the ability to solve GitHub issues), with scores above 49.
- Claude 3.5 Sonnet initially showed good scores, but as the tests below revealed, it demonstrated significant limitations in executing complex tasks.
2. Cost per Million Tokens
- DeepSeek R1: input: $0.55 and output: $2.19 (more economical),
- o3-mini: input: $1.10 and output: $4.40.
- Claude 3.5 Sonnet: input: $3.00 and output: $15.00.
3. Context Window
- o3-mini and Claude 3.5: Up to 200k tokens (better for larger and more complex requests).
- DeepSeek R1: Up to 128k tokens.
Practical Test 1: Building a Project from Scratch
Task: Create an interface to chat with local LLMs via Ollama, with chat functionalities, conversation history, and model selection.
Results:
Model | Files Generated | Functional Features | Observations |
---|---|---|---|
o3-mini using Cursor | 3 (HTML, CSS, and JS separated) | All | Code organized, but UI and styling very basic |
DeepSeek R1 on the Web | 1 (HTML, CSS, and JS condensed) | Chat and Model Selection | No Conversation History, UI and styling were better |
DeepSeek R1 using Cursor | 0 | - | Failed to create multiple files, many manual adjustments |
Claude 3.5 using Cursor | 0 | - | Completely failed |
Winner: o3-mini, for its consistency and ability to generate complex projects in a single request.
Practical Test 2: Adding a Feature to Existing Code
Task: Integrate a user interface (UI) into an existing CLI to interact with AI agents.
Results:
-
o3-mini using Cursor:
- Generated new files and added the feature after more than 20 iterations.
- Had greater difficulty understanding UI state management, requiring prompt adjustments and manual fixes after the generated result.
-
DeepSeek R1 using Cursor:
- Generated new files and added the feature in just 9 iterations, with cleaner and more organized code than o3-mini.
- Needed guidance to adjust some integrations, but was faster than o3-mini in understanding the requirements.
Winner: DeepSeek R1, although o3-mini is more "autonomous," it struggled significantly in understanding key functionalities for integration. In contrast, while DeepSeek R1 required more "supervision," it better understood the needs and delivered the new feature quickly.
Practical Test 3: Refactoring Code and Generating Tests
Task: Refactor functions in a React/TypeScript web application and add unit tests.
Results:
-
o3-mini using Cursor:
- Refactored the code, followed best practices, and generated functional tests (with minor adjustments needed).
-
DeepSeek R1 using Cursor:
- Introduced critical bugs by removing essential functions.
- Generated valid tests but failed in refactoring.
Winner: o3-mini, for its precision and lower risk of breaking existing code.
Final Recommendations
- For New Projects: Use o3-mini in Cursor. Its ability to generate structured code in a single pass is unmatched.
- For Complex Features: Combine o3-mini (for architecture) with DeepSeek R1 (for specific snippets).
- For Tight Budgets: DeepSeek R1 is the most economical choice but requires more attention and supervision during development.
What About Claude 3.5?
With a cost 7x higher and inferior performance already in the first practical test, Claude 3.5 is not a viable option for daily development. I recommend focusing on o3-mini and DeepSeek R1, which offer a better balance between cost and performance.
How to Use Both Models Together
- Planning Phase: Use o3-mini to outline the overall project structure. Its ability to handle large context windows allows for comprehensive planning.
- Optimization and Final Adjustments: After structuring the project, use DeepSeek R1 with continuous "supervision" to fine-tune specific functions, improve code efficiency, and reduce costs in specific tasks.
Final Considerations
The integration of AI models like o3-mini and DeepSeek R1 into the development workflow can completely transform the way we create and maintain projects.
While o3-mini stands out for its consistency and ability to handle complex tasks, DeepSeek R1 offers an economical solution for fine-tuning and specific tasks.
So, which model will you test first? 👨💻
Did you like it? Share your experiences in the comments! 🚀
Top comments (0)