One of the end goals of all Large Language Models (LLMs) we use nowadays is to be capable of understanding and performing any intellectual task that a human being can. This concept is commonly referred to as Artificial General Intelligence (AGI). The race towards AGI has ignited rapid developments in many LLMs from world-leading AI developers, such as OpenAI, Meta, Google, Anthropic, and Qwen.
Recently, new LLMs developed by DeepSeek have generated massive hype within the AI community due to their performance and operational cost combination. For example, the DeepSeek R1 model is claimed to perform similarly to OpenAI's most advanced reasoning model to date, the o1 model, with only a fraction of the training cost. Meanwhile, the DeepSeek V3 model's performance is comparable to GPT-4o and is at only a fraction of the training cost. Unlike OpenAI, DeepSeek has decided to fully open-source its models, allowing the entire AI community access to DeepSeek's model weights. This will speed up the process towards AGI even more.
This article will discuss several innovative features of the DeepSeek model, specifically DeepSeek V3, that make this LLM's performance comparable to the latest state-of-the-art, closed-source models available. So, without further ado, let's explore the first key innovative feature.
Feature One: Multi Head Latent Attention
At its core, DeepSeek V3 still adopts the classical Transformers architecture. It consists of a massive number of Transformer blocks, where each block contains several important layers: normalization, attention, and feed-forward layers, as you can see in the following visualization:
Visualization of a single Transformer block.
In this section, we're going to focus solely on the attention layer, since this is where the Multi-head Latent Attention (MLA) of DeepSeek V3 model resides.
In a nutshell, an attention layer expects the embedding representation of a token at a particular position as input. The first step of the attention layer is to project this input embedding into query, key, and value vectors using three learned weight matrices. The layer will then use these values to estimate the context of this particular token with respect to the previous tokens, a process commonly called the attention mechanism.
However, the way the attention mechanism is calculated poses a significant drawback. As you might already know, LLMs generate one token at a time in a sequence, and a new token always depends on the previously generated tokens. Therefore, to estimate the context of a new token, the attention of previous tokens needs to be recalculated. For example, generating token number 50 requires attention recalculation of tokens 1 through 49 every time. This results in a very slow token generation process during inference.
KV Cache management in vLLM. Source.
To solve this issue, an approach called KV cache is normally implemented in LLMs to speed up the token generation process. As the name suggests, with KV cache, the key and value of a new token are stored in a cache during each generation process. Therefore, during the attention calculation of a new token, we use the cached key and value of previous tokens instead of recomputing everything from scratch. This effectively speeds up the token generation process.
DeepSeek V3 also utilizes KV cache in its attention layer. In fact, it further advances the approach with the introduction of MLA. In essence, MLA compresses the input embedding dimension into its low-rank representation by removing redundant elements. As a result of this compression, the size of key, value, and query vectors becomes even smaller, thereby optimizing the memory for KV cache and speeding up the token generation process.
Architecture of DeepSeek V3 in a single Transformer block. Source.
As you can see from the figure above, the approach jointly compresses key and value together into their low-rank representation. This compressed version of the key-value vector can then be cached similarly to normal KV cache.
Meanwhile, the query is compressed independently. Once compressed, the low-rank representation of the query vector is then processed by two different pipelines: one is projected directly with a layer to map it back into its high-dimensional representation, and another is processed by an approach called Rotary Positional Embedding (RoPE). The RoPE method is important for introducing positional information of the new token in a sequence. The outputs of these two pipelines are then concatenated into one final input for the multi-head attention layer.
The jointly compressed key-value vector also undergoes a similar process to the query vector. However, the input for RoPE of the key vector comes from the original input embedding instead of the compressed key-value vector.
Feature Two: DeepSeek MoE
Another fascinating approach implemented within DeepSeek V3 is the Mixture of Experts (MoE) approach. As you can see from the image above, this method is implemented in DeepSeek V3 as a replacement for the original feed-forward network in the Transformers block.
Let's use an example to easily understand what MoE does. Imagine we're studying at a university with many professors, each an expert in a different subject (math, physics, literature). When we want to ask something about calculus, we'll be directed to the math professor. Likewise, if we want to ask something about quantum physics, we'll be directed to the physics professor.
MoE works in a similar way. It consists of many models, each with its own expertise to solve a particular problem.
During the training phase, each model gets different data from a particular domain, such that they become experts in solving tasks from that domain. Then, during inference, instead of relying on a single massive model to handle every domain of a problem, MoE will assign the query to the most capable expert models. This approach makes inference faster and more efficient, since only a small number of expert models will be activated during prediction, depending on the task.
MoE in DeepSeek V3. Source.
An important element in an MoE approach is the gating network. This network has two main responsibilities: to analyze the input query and then route it to the most appropriate expert models. However, a common problem regarding MoE training is the load balancing issue, where the gating network keeps routing all training data into one specific model instead of distributing it to other models.
Implementing an auxiliary loss helps to force the gating network to learn to distribute the training data to different models. The problem is, relying on auxiliary loss alone has been shown to degrade the model's performance after training.
To introduce a trade-off between load balancing and model performance, DeepSeek V3 implemented an auxiliary-loss-free load balancing strategy. This strategy introduces a bias term to each expert model that will be dynamically adjusted depending on the routing load of the corresponding expert. This ensures that no expert model gets overloaded or under-utilized.
Also, as you can see in the visualization above, DeepSeek V3 designed certain experts to be "shared experts," and these experts are always active for various tasks. This implementation helps to improve the model's ability to generalize across different domains of tasks.
This MoE feature is the secret recipe behind the versatility of DeepSeek V3. As you'll see in the next section, DeepSeek V3 is highly performant in various tasks with different domains such as math, coding, language, etc. In fact, this model is currently the strongest open-source base model in several domains.
Feature Three: Multi Token Predictions
Common LLMs predict one token in each decoding step, but DeepSeek V3 operates differently, especially in its training phase. DeepSeek V3 implements the so-called multi-token predictions (MTP) during training that enables the model to predict several future tokens in each decoding step.
Although it adds layers of complexity, the MTP approach is important for improving the model's performance across different tasks. As you can imagine, by looking at possible future tokens several steps ahead in one decoding step, the model is able to learn the best possible solution for any given task.
Visualization of MTP approach in DeepSeek V3. Source.
To implement MTP, DeepSeek V3 adopts more than one model, each consisting of a bunch of Transformer layers. One model acts as the main model, while the others act as MTP modules. Although it's not clearly defined, the MTP model is commonly smaller in size compared to the main model (the total size of the DeepSeek V3 model on HuggingFace is 685B, with 671B from the main model and 14B from the MTP module).
During the training phase, both the main model and MTP modules take input from the same embedding layer. However, the implementation still needs to be done in sequence, i.e., the main model should go first by predicting the token one step ahead, and after that, the first MTP module will predict the token two steps ahead. This process continues depending on the number of MTP modules. After predicting the tokens, both the main model and MTP modules will use the same output head.
We can be totally flexible with the MTP module during the inference phase. For example, we can completely discard the MTP module and use only the main model during inference, just like common LLMs. Also, we can use the MTP module to implement a speculative decoding approach to potentially speed up the generation process even more.
DeepSeek V3 Cost and Performance Compared to Other Models
All of the innovative features mentioned above enabled the DeepSeek V3 model to be trained much more cheaply than its closed-source competitors.
DeepSeek V3 was trained on a cluster with 2,048 NVIDIA H800 GPUs. The pre-training phase of the DeepSeek V3 model cost around $5.328M, while the post-training phase with fine-tuning and reinforcement learning cost $0.01M, bringing the total to $5.576M to train the DeepSeek V3 model from start to finish. This is significantly cheaper than the training of OpenAI's GPT-4, which reportedly cost around $100M.
DeepSeek V3 also showed superior performance compared to other open and closed-source LLMs such as Qwen2.5 72B, Llama 3.1 405B, Claude 3.5 Sonnet, and ChatGPT 4o across different benchmarks, as you can see in the figure below:
Comparison between DeepSeek-V3 and other state-of-the-art chat models. Source.
DeepSeek V3's performance has proven to be superior compared to other state-of-the-art models in various tasks, such as coding, math, and Chinese. Its performance in English tasks showed comparable results with Claude 3.5 Sonnet in several benchmarks.
Additionally, the performance of DeepSeek V3 has been compared with other LLMs on open-ended generation tasks using GPT-4-Turbo-1106 as a judge and length-controlled win rate as the metric. As a result, DeepSeek V3 demonstrated the best performance compared to others on Arena-Hard and AlpacaEval 2.0 benchmarks.
Comparison between DeepSeek-V3 and other state-of-the-art chat models on AlpacaEval 2.0 and Arena-Hard benchmarks. Source.
The superior performance of DeepSeek V3 on both Arena-Hard and AlpacaEval 2.0 benchmarks showcases its ability and robustness in handling long, complex prompts as well as writing tasks and straightforward question-answer scenarios.
How Developers Can Leverage DeepSeek V3
Apart from its performance, another main appeal of the DeepSeek V3 model is its open-source nature. DeepSeek has decided to open-source the V3 model under the MIT license, which means that developers can have free access to its weights and use it for their own purposes, even for commercial use.
We can use it for various GenAI use cases, from personalized recommendations and content generation to virtual assistants, internal chatbots, document summarization, and many more. These use cases also enable us to combine the power of DeepSeek V3 with Milvus, an open-source vector database, to store billions of context embeddings.
At the time of writing this article, DeepSeek V3 hasn't been integrated into Hugging Face yet. However, expect it to be integrated very soon so that you can use and run the model locally in an easy way. While we're waiting for the official Hugging Face integration, you can run DeepSeek V3 in several ways.
The easiest way to try out DeepSeek V3 is through the official chat platform of DeepSeek. All you need to do is sign up and start chatting with the model.
If you'd like to run it locally on your machine, you need to first clone the official DeepSeek V3 repository with the following command:
git clone <https://github.com/deepseek-ai/DeepSeek-V3.git>
Next, go to the inference
folder and install all the required dependencies by running the following commands:
cd DeepSeek-V3/inference
pip install -r requirements.txt
Next, you need to download the model’s weight. There are two model weights available on HuggingFace: the base version (only after the pre-training phase) and the chat version (after post-training phase). Download the model version that you like and then put the weights inside of /path/to/DeepSeek-V3
folder.
Now you can convert the HuggingFace model weights to a specific format with the following command:
python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 --save-path /path/to/DeepSeek-V3-Demo --n-experts 256 --model-parallel 16
And finally you can run this command to start chatting with DeepSeek V3:
torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200
Alternative way to get up and running with DeepSeek V3 is via several LLM-optimized serving frameworks, such vLLM, SGLang, LMDeploy, and TensorRT-LLM.
Future Outlook After DeepSeek V3
The introduction of DeepSeek V3 can be seen as a significant breakthrough in many aspects. Many innovations implemented in DeepSeek V3's training phase, such as MLA, MoE, MTP, and mixed-precision training with FP8 quantization, have opened up a pathway for us to develop an LLM that is not only performant and efficient but also significantly cheaper to train.
The implementation of MLA, MoE, and MTP contributes to speeding up the token generation process during inference in different ways:
MLA enables us to save KV cache memory and speed up token generation by compressing the dimension of input representations into their low-rank representation.
MoE speeds up the token generation process and improves model scalability by activating only certain experts during inference, depending on the task. Instead of activating all 671B parameters during inference, the model will only activate a small fraction of it (around 37B).
MTP can be repurposed during inference to facilitate a speculative decoding approach. With this approach, the next token prediction can start from possible future tokens predicted by MTP modules instead of predicting it from scratch.
The fact that DeepSeek chose to open-source DeepSeek V3 under the MIT license also encourages us, the global AI community, to contribute, experiment, and build on its technology. This, in turn, puts all of us in the loop for faster innovation towards the goal of reaching AGI that benefits us all.
Although its performance is already superior compared to other state-of-the-art LLMs, research suggests that the performance of DeepSeek V3 can be improved even more in the future.
Previously, the DeepSeek team conducted research on distilling the reasoning power of its most powerful model, DeepSeek R1, into the DeepSeek V2.5 model. If you're not familiar with it, distillation refers to the process of transferring the knowledge of a bigger and more performant model into a smaller one.
DeepSeek V2.5 showed significant improvements on LiveCodeBench and MATH-500 benchmarks when presented with additional distillation data from the R1 model, although it also came with an obvious drawback: an increase in average response length.
The contribution of distillation from DeepSeek-R1 on DeepSeek V2.5. Source.
Nonetheless, this research shows that the same knowledge distillation technique can also be applied to DeepSeek V3 in the future to further optimize its performance across various data domains.
Conclusion
DeepSeek V3 represents a major step forward in the field of open-source AI. It offers a performance that’s comparable to leading closed-source models only at a fraction of training costs. Its innovative features, including Multi-Head Latent Attention (MLA), Mixture of Experts (MoE), and Multi-Token Predictions (MTP), contribute to both efficiency and accuracy during training and inference phase. Also, its open-source nature under the MIT license enables the AI community to build on its advancements, thus accelerating progress toward AGI.
Looking ahead, DeepSeek V3’s impact can be even more powerful. The potential application of knowledge distillation techniques, as previously explored by DeepSeek R1 and DeepSeek V2.5, suggests room for further optimization and efficiency improvements. We can say that DeepSeek V3 sets a new benchmark for cost-effective and high-performance AI research.
Top comments (1)
Informative.