Mahmoud Ayoub

Posted on Jan 28

How DeepSeek Narrowed the Gap to OpenAI’s o1 Model: A Revolutionary Step in Reasoning AI

#deepseek #openai #ai

In January 2025, DeepSeek-AI introduced its reasoning model, DeepSeek-R1, claiming performance on par with OpenAI's o1-1217 model. By combining reinforcement learning (RL) with innovative training approaches, DeepSeek achieved remarkable reasoning performance without the vast computational resources typically associated with pretraining. This article explores how DeepSeek brought its model within striking distance of OpenAI’s and highlights key insights for the AI community.

Superiority of DeepSeek's Approach

Reinforcement Learning as the Core Training Strategy

DeepSeek leveraged Group Relative Policy Optimization (GRPO), a cost-effective RL algorithm, to optimize reasoning capabilities. Unlike traditional supervised fine-tuning, GRPO enabled significant improvements in math, coding, and logical reasoning by sampling and comparing group outputs during training.
Two-Tiered Model Development
- DeepSeek-R1-Zero: Trained purely with RL, this model displayed self-evolution, developing advanced problem-solving behaviors such as reflection and iterative re-evaluation.
- DeepSeek-R1: Built upon R1-Zero, this version added a cold-start phase, utilizing curated Chain-of-Thought (CoT) datasets to produce coherent, user-friendly outputs and align with human preferences.
Cold Start Data for Readability and Accuracy

The cold-start phase addressed RL’s training instability by incorporating a small set of high-quality CoT examples. This improved both readability and alignment with user expectations, ensuring the model produced clearer and more accurate outputs.
Revolutionizing Distillation

DeepSeek demonstrated the power of distillation, transferring reasoning capabilities from the 70B-parameter DeepSeek-R1 into smaller models like Qwen-14B and Qwen-32B. These smaller models outperformed many larger counterparts, achieving state-of-the-art results on benchmarks such as AIME 2024 and MATH-500 without requiring expensive RL training.
Benchmark Excellence
- Achieved 97.3% on MATH-500 and 79.8% on AIME 2024, matching OpenAI-o1-1217.
- Excelled on Codeforces, with an Elo rating of 2029, outperforming 96% of human participants.
- Delivered strong results on non-reasoning tasks like creative writing, summarization, and editing, with a 92.3% win-rate on ArenaHard.
Emergent Behaviors

During RL training, DeepSeek-R1-Zero developed advanced reasoning strategies like reflection, verification, and prolonged thinking time. These unprogrammed emergent behaviors underscored RL’s potential to drive high-level intelligence.
Open-Source Contributions

DeepSeek went beyond the norm by open-sourcing not only its primary models but also six smaller dense models distilled from DeepSeek-R1. This decision enables researchers to build on its achievements without facing prohibitive computational costs.

Challenges Faced and Overcome

Instability in Early RL Training
- Challenge: Pure RL training led to unstable outputs, including poor readability and language mixing.
- Solution: The cold-start phase stabilized training by giving the model a structured foundation, significantly improving output quality.
Language Mixing in Chain-of-Thought (CoT)
- Challenge: RL training often resulted in mixed-language responses, reducing accessibility.
- Solution: A language consistency reward was introduced to enforce single-language outputs, aligning with user preferences.
Scaling RL for Smaller Models
- Challenge: Direct RL on smaller models was computationally expensive and yielded limited results.
- Solution: Reasoning patterns were distilled from DeepSeek-R1 to smaller models like Qwen and Llama, achieving strong performance with far lower costs.
Cold-Start Data Challenges
- Challenge: Curating high-quality cold-start datasets was time-intensive but necessary.
- Solution: Strategies like refining outputs, using long CoT examples, and employing human annotators ensured effective datasets.
Sensitivity to Prompts
- Challenge: DeepSeek-R1’s performance was highly sensitive to how prompts were phrased.
- Solution: Users were advised to adopt zero-shot prompting, directly describing problems and output formats for optimal results.
Impact of Safety RL
- Challenge: Safety-focused RL caused overly cautious behavior, such as refusing to answer certain queries on the Chinese SimpleQA benchmark.
- Solution: Plans are in place to fine-tune safety mechanisms to better balance task performance and risk management.
Complexity of Software Engineering Tasks
- Challenge: Long evaluation times limited RL’s effectiveness for coding and engineering tasks.
- Solution: Future iterations will implement asynchronous evaluations and rejection sampling to boost efficiency in these areas.
Challenges with Fine-Grained Rewards
- Challenge: Process-based reward models struggled to define intermediate steps and were prone to reward hacking.
- Solution: DeepSeek adopted simpler rule-based accuracy rewards, ensuring a robust RL pipeline.
Monte Carlo Tree Search (MCTS) Limitations
- Challenge: MCTS failed to scale due to the large search space in token generation.
- Solution: RL with CoT was more practical and effective for handling complex reasoning tasks.

Key Takeaways

Reinforcement Learning Alone Can Drive Reasoning

DeepSeek proved that RL alone can develop strong reasoning capabilities, challenging the reliance on supervised fine-tuning.
Cold Start Data Makes a Big Difference

Introducing a small, high-quality dataset as a cold start greatly improved training stability and output clarity, solving major RL-only issues.
Distillation Expands Access

By distilling reasoning capabilities into smaller models, DeepSeek made high-performance AI accessible without massive computational requirements.
Emergent Behaviors Show RL’s Power

Spontaneous behaviors like reflection and iterative problem-solving highlight the potential of RL to unlock sophisticated reasoning in AI.
Open Source Accelerates Progress

DeepSeek’s open-source models invite collaboration and innovation, speeding up advancements in reasoning AI.
Competitive Results Validate the Approach

With performance rivaling OpenAI’s o1-1217 on reasoning and coding benchmarks, DeepSeek-R1 proved itself as a serious contender in the AI space.

Benchmark Analysis

General Knowledge Performance

Achieved 90.8% on MMLU (Pass@1), surpassing GPT-4 (88.5%) and Claude-3.5 (88.3%)
Exceptional performance on MMLU-Pro with 84.0%, significantly ahead of competitors
Strong showing on DROP with 92.2% F1 score, outperforming all tested models

Mathematical Reasoning

Demonstrated remarkable mathematical abilities:
- MATH-500: 97.3% (versus OpenAI o1-1217's 96.4%)
- AIME 2024: 79.8% (nearly matching o1-1217's 79.2%)
- CNMO 2024: 78.8% (significantly higher than GPT-4's 43.2%)

Coding Capabilities

Achieved elite-level performance on Codeforces:
- 2029 rating (96.3 percentile)
- Nearly matched OpenAI o1-1217's 2061 rating
Strong performance on LiveCodeBench with 65.9% Pass@1-COT
Solid results on SWE Verified tasks at 49.2% resolution rate

Multilingual Understanding

Demonstrated strong Chinese language capabilities:
- C-Eval: 91.8%
- CLUEWSC: 92.8%
- C-SimpleQA: 63.7%
Outperformed most competitors in Chinese-language tasks

DeepSeek-R1 represents a landmark achievement in AI development, demonstrating that sophisticated reasoning capabilities can be achieved through innovative RL approaches without requiring massive computational resources. By combining GRPO with cold-start training and successful distillation strategies, DeepSeek has not only matched industry leaders but also made these capabilities more accessible to the broader AI community.

The success of DeepSeek-R1 suggests a promising future where advanced AI reasoning becomes more democratized. As the field continues to evolve, the lessons learned from DeepSeek's approach—particularly around RL training stability, model distillation, and open-source collaboration—will likely shape the next generation of AI development.

Top comments (1)

Marc Khaled • Jan 29

Interesting article!🔥

DEV Community

How DeepSeek Narrowed the Gap to OpenAI’s o1 Model: A Revolutionary Step in Reasoning AI

Superiority of DeepSeek's Approach

Challenges Faced and Overcome

Key Takeaways

Benchmark Analysis

General Knowledge Performance

Mathematical Reasoning

Coding Capabilities

Multilingual Understanding

Top comments (1)

Read next

Optimize VLM Tokens with EmbedAnything x ColPali

The Future of Customer-Centric Digital Banks: Bold Innovations for a Transformative Banking Experience

SocialBuddy: create simple, impactful content for any platform

Ciberseguridad e Inteligencia Artificial: La Nueva Frontera de los Perfiles Tecnológicos