AI research is evolving fast, but training massive models is still a tough challenge because of the huge computing power needed. That’s where DeepSeek is changing the game. They’ve found a way to build top-tier AI models without burning through an enormous number of GPUs. By using a smart mix of cost-effective training strategies, Nvidia’s PTX assembly, and reinforcement learning, they’ve created cutting-edge models like DeepSeek-R1-Zero and DeepSeek-R1—proving that innovation doesn’t always have to come with an extreme price tag.
Optimizing GPU Usage: Cost-Effective Training at Scale
DeepSeek has set a new benchmark in efficient AI model training. For instance, DeepSeek trained its DeepSeek-V3 Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster of 2,048 Nvidia H800 GPUs in just two months—totaling 2.8 million GPU hours, according to its research paper.
In comparison, OpenAI’s GPT-4, one of the most advanced language models, is estimated to have been trained using tens of thousands of Nvidia A100/H100 GPUs over several months, with a significantly higher compute cost. Similarly, Meta’s Llama 3, which has 405 billion parameters, required 30.8 million GPU hours—11 times more compute power than DeepSeek-V3—using 16,384 H100 GPUs over 54 days.
DeepSeek’s approach stands out because of its cost-effective training strategies and efficient utilization of Nvidia’s PTX assembly and reinforcement learning techniques. By optimizing GPU usage and computational efficiency, it is proving that cutting-edge AI models don’t have to come with an astronomical price tag. This shift challenges the traditional belief that only massive GPU clusters can produce state-of-the-art models, making high-performance AI development more accessible and sustainable.
The Real Cost of DeepSeek-R1: Beyond the $6M Hype
There's been a lot of hype around DeepSeek's claim that they trained their latest model for just $6 million. But let’s be real—this number only covers the GPU rental costs for the final pre-training run. The actual investment behind DeepSeek R1 is much, much bigger.
Born from High-Flyer, a Chinese hedge fund that embraced AI early, DeepSeek made a bold move in 2021—acquiring 10,000 A100 GPUs before export restrictions tightened. This early bet secured a massive computational advantage. By 2023, they spun off as an independent AI lab, self-funded and ahead of the curve.
Today, DeepSeek operates with around 50,000 Hopper GPUs, including H800s, H100s, and incoming H20s, shared with High-Flyer’s trading operations. Their actual investment? Likely over $500 million, with total infrastructure costs nearing $1.3 billion.
Beyond hardware, DeepSeek’s strength lies in its elite 150-person team, handpicked for skill over credentials. Top engineers earn over $1.3 million annually—outpacing salaries at Chinese tech giants. Free from corporate bureaucracy, they optimize everything in-house, running their own data centers to push AI research further.
DeepSeek isn’t just cost-efficient—it’s strategically built for dominance. The "$6M" figure is a footnote in a much bigger story of foresight, risk-taking, and deep R&D.
Unlocking Maximum Efficiency: How DeepSeek Used PTX to Push GPU Limits
While most AI companies stick to Nvidia’s CUDA framework to train large models, DeepSeek took a bold and unconventional approach—leveraging PTX (Parallel Thread Execution) assembly to unlock previously untapped efficiency in GPU operations. This decision played a crucial role in the success of DeepSeek-R1-Zero and DeepSeek-R1, allowing them to be trained with fewer GPUs while maintaining high performance.
But what exactly is PTX, and why does it matter?
PTX: The Assembly Language of Nvidia GPUs
CUDA is often the go-to framework for AI development because it provides an easy-to-use interface for GPU programming. However, CUDA is essentially a high-level abstraction—it translates code into PTX, which is Nvidia’s intermediate assembly language for GPU execution.
Think of it like this:
CUDA is like writing in Python—easy to use, but not the fastest.
PTX is like writing in assembly language—harder to master, but gives full control over hardware performance.
DeepSeek went beyond CUDA, rewriting key computational operations directly in PTX. This allowed them to optimize memory access, reduce instruction overhead, and execute GPU instructions with greater precision, pushing performance to its limits.
Why PTX Gave DeepSeek-R1-Zero and R1 an Edge
By tapping into PTX, DeepSeek unlocked three major advantages that traditional CUDA-based models often miss:
Ultra-Fine Hardware Control
CUDA automatically optimizes code, but it doesn’t always make the best choices for efficiency.
With PTX, DeepSeek manually fine-tuned GPU instructions, ensuring every computational cycle was used efficiently.
Optimized Memory Utilization
One of the biggest bottlenecks in training AI models is memory overhead. CUDA’s default memory allocation can be inefficient, leading to wasted GPU memory.
DeepSeek restructured tensor operations at the PTX level, reducing memory bottlenecks and increasing throughput.
Better Instruction Scheduling & Parallel Execution
GPUs are designed to process thousands of operations in parallel, but CUDA’s compiler doesn’t always schedule instructions optimally.
DeepSeek rewrote key computational kernels in PTX, achieving faster execution times and fewer processing stalls.
PTX in Action: The DeepSeek Difference
Most AI companies throw more GPUs at the problem to speed up training. DeepSeek, on the other hand, focused on efficiency. By bypassing CUDA in key areas and directly optimizing PTX execution, they maximized GPU utilization without increasing hardware costs.
This shift in approach redefines what’s possible in AI training. Instead of relying on brute-force compute power, DeepSeek proved that smart software optimizations can be just as impactful as expensive hardware upgrades.
By mastering PTX, DeepSeek is not just developing AI models—it’s reshaping how AI is built, proving that next-generation models can be trained smarter, not harder.
DeepSeek-R1-Zero: Reinforcement Learning Without Supervised Fine-Tuning
A Bold Step in AI Development
With DeepSeek-R1-Zero, an AI model trained purely through Reinforcement Learning (RL) without any Supervised Fine-Tuning (SFT). Traditionally, AI models undergo supervised fine-tuning before reinforcement learning to improve their reasoning skills. However, DeepSeek-R1-Zero skipped this step entirely, proving that a model can develop strong reasoning abilities autonomously through trial and feedback. This unconventional method challenges the long-held belief that supervised fine-tuning is essential for high-performance AI.
Exceptional Reasoning Without Supervised Training
One of the most remarkable aspects of DeepSeek-R1-Zero is its ability to generalize and solve complex problems using RL alone. Unlike traditional models that rely on large amounts of labeled data, this model learned organically by interacting with its environment. Additionally, its performance could be further improved using majority voting, a technique that refines responses by selecting the most common answer across multiple attempts. On the AIME benchmark, DeepSeek-R1-Zero’s accuracy increased from 71.0% to 86.7% when majority voting was applied, even surpassing OpenAI-o1-0912. This achievement highlights the true potential of reinforcement learning in building highly capable AI systems.
The Self-Evolution Process
A key aspect of DeepSeek-R1-Zero’s development was its ability to self-evolve without human intervention. Since the model was trained purely with RL, researchers could closely observe how it progressed and refined its reasoning over time. Instead of improving based on human-provided examples, DeepSeek-R1-Zero learned from reinforcement feedback alone. By increasing test-time computation—giving itself more time to process and generate reasoning tokens—the model naturally improved its problem-solving strategies. This process demonstrated that AI can teach itself to think more deeply without requiring external adjustments.
Emergent Behaviors: Reflection and Alternative Problem-Solving
One of the most fascinating discoveries during DeepSeek-R1-Zero’s training was the spontaneous emergence of advanced reasoning behaviors. The model began to reflect on its own responses, revisiting and improving previous answers. It also started exploring multiple ways to solve a problem, rather than sticking to a single fixed approach. These behaviors weren’t explicitly programmed but emerged organically as a result of the RL training process. This milestone suggests that reinforcement learning can lead AI to develop structured thinking on its own, a significant step toward more autonomous and intelligent models.
The “Aha Moment” – A Breakthrough in AI Reasoning
One of the most intriguing moments in DeepSeek-R1-Zero’s evolution was the so-called "aha moment". At a certain stage in its development, the model realized that allocating more time to difficult problems led to better solutions. Instead of rushing to generate responses, it started pausing, reconsidering, and refining its reasoning process. This shift wasn’t directly taught to the model—it emerged naturally as a result of reinforcement learning optimizing for better problem-solving strategies. For researchers, witnessing this shift was just as exciting as it was for the model itself. It highlighted the power of reinforcement learning to drive independent intelligence and showed how AI can develop strategies beyond what was explicitly programmed.
Challenges: Readability and Language Mixing
Despite its impressive reasoning capabilities, DeepSeek-R1-Zero was not without flaws. One major issue was readability, as the model’s reasoning process was often difficult to follow. Additionally, it sometimes suffered from language mixing, blending multiple languages in its responses, which reduced clarity. These challenges showed that while RL alone can drive strong reasoning development, a balance between autonomous learning and structured human guidance is still necessary for a more practical AI system.
Refining the Model: The Introduction of DeepSeek-R1
To address these shortcomings, DeepSeek introduced DeepSeek-R1, a refined version that combines RL with a human-friendly “cold-start” dataset. This hybrid approach maintains the strong reasoning capabilities of DeepSeek-R1-Zero while improving readability and response structure. By integrating some level of human supervision, DeepSeek-R1 ensures that its reasoning remains strong while also making its output more coherent and accessible.
Upgrading to DeepSeek-R1: Reinforcement Learning with SFT and Cold Start Data
DeepSeek took a significant leap forward from DeepSeek-R1-Zero by integrating Supervised Fine-Tuning (SFT) and Cold Start Data into its training pipeline, leading to the development of DeepSeek-R1. This upgrade enhanced the model’s reasoning capabilities and alignment with human preferences, setting a new standard for open-source AI models.
The Two-Stage RL and SFT Pipeline
DeepSeek-R1 improved upon its predecessor by incorporating a refined two-stage reinforcement learning (RL) and SFT process:
- Two RL Stages: Focused on refining reasoning abilities while ensuring outputs align with human expectations.
- Two SFT Stages: Built a strong foundation for both reasoning and non-reasoning tasks to improve overall model performance.
Addressing the Cold Start Problem
One of the major challenges in AI training is the cold start problem, where models struggle in early training phases due to a lack of initial guidance. DeepSeek tackled this by carefully curating high-quality, diverse datasets for the first SFT stage. This ensured the model acquired solid foundational knowledge before reinforcement learning took over.
- Readability Improvements: Unlike DeepSeek-R1-Zero, which sometimes generated unreadable or mixed-language responses, DeepSeek-R1’s cold start data was designed with structured formatting, including a clear reasoning process and summary for each response.
- Performance Boost: By strategically crafting cold-start data with human-guided patterns, DeepSeek-R1 exhibited superior reasoning abilities compared to DeepSeek-R1-Zero.
Enhancing Reasoning with Reinforcement Learning
After establishing a solid foundation with cold-start data, DeepSeek-R1 employed large-scale reinforcement learning to further enhance its reasoning skills. This phase focused on areas requiring structured logic, such as coding, mathematics, and science.
- Language Consistency Rewards: To prevent responses from mixing multiple languages, DeepSeek introduced a reward mechanism that prioritized target-language consistency, ensuring more user-friendly outputs.
- Optimized Reasoning Tasks: The model balanced accuracy in logic-driven tasks with human readability, refining its problem-solving approach through iterative reinforcement learning.
Supervised Fine-Tuning for Diverse Capabilities
Once RL training reached convergence, the next step was supervised fine-tuning (SFT) to further refine the model across reasoning and non-reasoning tasks.
- Reasoning Data: Leveraging rejection sampling, DeepSeek-R1 curated a dataset of 600k reasoning-focused training samples, ensuring only high-quality responses were included.
- Non-Reasoning Data: A separate dataset of 200k samples covered diverse areas like writing, factual Q&A, self-cognition, and translation, enabling DeepSeek-R1 to perform well beyond just logic-based tasks.
Reinforcement Learning for Holistic Improvement
To ensure DeepSeek-R1 aligned with human preferences while maintaining strong reasoning, an additional reinforcement learning phase was introduced. This phase prioritized:
- Helpfulness: Ensuring responses were relevant and user-friendly, with a focus on clear and useful summaries.
- Harmlessness: Filtering out biases, harmful content, or misleading information while maintaining logical accuracy.
- Balanced Training: Integrating reasoning and general-purpose training data to create a well-rounded model capable of excelling in both structured problem-solving and open-ended tasks.
Final Outcome: A Breakthrough in Open-Source AI
Reinforcement Learning for Holistic Improvement
By combining reinforcement learning, supervised fine-tuning, and strategically curated cold start data, DeepSeek-R1 emerged as a groundbreaking model, outperforming its predecessors. Its distilled versions (DeepSeek-R1-Distill) achieved state-of-the-art results in reasoning benchmarks, proving the effectiveness of this hybrid training approach. DeepSeek-R1 not only pushes the boundaries of AI reasoning but also ensures outputs are more user-friendly, readable, and aligned with human expectations.
Top comments (0)