Ishan Mishra

Posted on Feb 2

Supercharging Deepseek-R1 with Ray + vLLM: A Distributed System Approach

#deepseek #ray #distributedsystems #ai

Video Tutorial

Intended Audience 👤

Everyone who is curious and ready to explore extra links OR
Familiarity with Ray
Familiarity with vLLM
Familiarity with kubernetes

Intro 👋

We are going to explore how we can run a 32B Deepseek-R1 quantized to 4 bit model, model_link. We will be using 2 Tesla-T4 gpus each 16GB of VRAM, and azure for our kubernetes setup and vms, but this same setup can be done in any platform or local as well.

Setting up kubernetes ☸️

Our kubernetes cluster will have 1 CPU and 2 GPU modes. Lets start by creating a resource group in azure, once done then we can create our cluster with the following command(change name, resource group and vms accordingly):

az aks create --resource-group rayBlog \  
    --name rayBlogCluster \  
    --node-count 1 \  
    --enable-managed-identity \  
    --node-vm-size Standard_D8_v3 \  
    --generate-ssh-keys

Here I am using Standard_D8_v3 VM it has 8vCPUs and 32GB of ram, after the cluster creation is done lets add two more gpu nodes using the following command:

az aks nodepool add \  
    --resource-group rayBlog \  
    --cluster-name rayBlogCluster \  
    --name gpunodepool \  
    --node-count 2 \  
    --node-vm-size Standard_NC4as_T4_v3 \  
    --labels node-type=gpu

I have chosen Standard_NC4as_T4_v3 VM for for GPU node and kept the count as 2, so total we will have 32GB of VRAM(16+16). Lets now add the kubernetes config to our system: az aks get-credentials --resource-group rayBlog --name rayBlogCluster.
We can now use k9s(want to explore k9s?) to view our nodes and check if everything is configured correctly.

As shown in image above, our gpu resources are not available in gpu node, this is because we have to create a nvidia config, so lets do that, we are going to use kubectl(expore!) for it:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
Now lets check again:

Great! but before creating our ray cluster we still have one step to do: apply taints to gpu nodes so that its resources are not exhausted by other helper functions: kubectl taint nodes <gpu-node-1> gpu=true:NoSchedule and same for second gpu node.

Creating ray cluster 👨‍👨‍👦‍👦

We are going to use kuberay operator(🤔) and kuberay apiserver(❓). Kuberay apiserve allows us to create the ray cluster without using native kubernetes, so that's a convenience, so lets install them(what is helm?):

helm repo add kuberay https://ray-project.github.io/kuberay-helm/

helm install kuberay-operator kuberay/kuberay-operator --version 1.2.2

helm install kuberay-apiserver kuberay/kuberay-apiserver --version 1.2.2

Lets portforward our kuberay api server using this command: kubectl port-forward <api server pod name> 8888:8888. Now lets create a common namespace where ray cluster related resources will reside k create namespace ray-blog. Finally we are ready to create our cluster!
We are first creating the compute template that specifies the resource for head and worker group.
Send POST request with below payload to http://localhost:8888/apis/v1/namespaces/ray-blog/compute_templates
For head:

{
    "name": "ray-head-cm",
    "namespace": "ray-blog",
    "cpu": 5,
    "memory": 20
}

For worker:

{
    "name": "ray-worker-cm",
    "namespace": "ray-blog",
    "cpu": 3,
    "memory": 20,
    "gpu": 1,
    "tolerations": [
    {
      "key": "gpu",
      "operator": "Equal",
      "value": "true",
      "effect": "NoSchedule"
    }
  ]
}

NOTE: We have added tolerations to out worker spec since we tainted our gpu nodes earlier.
Now lets create the ray cluster, send POST request with below payload to http://localhost:8888/apis/v1/namespaces/ray-blog/clusters

{
   "name":"ray-vllm-cluster",
   "namespace":"ray-blog",
   "user":"ishan",
   "version":"v1",
   "clusterSpec":{
      "headGroupSpec":{
         "computeTemplate":"ray-head-cm",
         "rayStartParams":{
            "dashboard-host":"0.0.0.0",
            "num-cpus":"0",
            "metrics-export-port":"8080"
         },
         "image":"ishanextreme74/vllm-0.6.5-ray-2.40.0.22541c-py310-cu121-serve:latest",
         "imagePullPolicy":"Always",
         "serviceType":"ClusterIP"
      },
      "workerGroupSpec":[
         {
            "groupName":"ray-vllm-worker-group",
            "computeTemplate":"ray-worker-cm",
            "replicas":2,
            "minReplicas":2,
            "maxReplicas":2,
            "rayStartParams":{
               "node-ip-address":"$MY_POD_IP"
            },
            "image":"ishanextreme74/vllm-0.6.5-ray-2.40.0.22541c-py310-cu121-serve:latest",
            "imagePullPolicy":"Always",
            "environment":{
               "values":{
                  "HUGGING_FACE_HUB_TOKEN":"<your_token>"
               }
            }
         }
      ]
   },
   "annotations":{
      "ray.io/enable-serve-service":"true"
   }
}

Things to understand here:

We passed the compute templates that we created above
Docker image ishanextreme74/vllm-0.6.5-ray-2.40.0.22541c-py310-cu121-serve:latest setups ray and vllm on both head and worker, refer to code repo for more detailed understanding. The code is an updation of already present vllm sample in ray examples, I have added few params and changed the vllm version and code to support it
Replicas are set to 2 since we are going to shard our model between two workers(1 gpu each)
HUGGING_FACE_HUB_TOKEN is required to pull the model from hugging face, create and pass it here
"ray.io/enable-serve-service":"true" this exposes 8000 port where our fast-api application will be running

Deploy ray serve application 🚀

Once our ray cluster is ready(use k9s to see the status) we can now create a ray serve application which will contain our fast-api server for inference. First lets port forward our head-svc 8265 port where our ray serve is running, once done send a PUT request with below payload to http://localhost:8265/api/serve/applications/

{
   "applications":[
     {
         "import_path":"serve:model",
         "name":"deepseek-r1",
         "route_prefix":"/",
         "autoscaling_config":{
            "min_replicas":1,
            "initial_replicas":1,
            "max_replicas":1
         },
         "deployments":[
            {
               "name":"VLLMDeployment",
               "num_replicas":1,
               "ray_actor_options":{

               }
            }
         ],
         "runtime_env":{
            "working_dir":"file:///home/ray/serve.zip",
            "env_vars":{
               "MODEL_ID":"Valdemardi/DeepSeek-R1-Distill-Qwen-32B-AWQ",
               "TENSOR_PARALLELISM":"1",
               "PIPELINE_PARALLELISM":"2",
               "MODEL_NAME":"deepseek_r1"
            }
         }
      }
   ]
}

Things to understand here:

ray_actor_options are empty because whenever we pass tensor-parallelism or pipeline-parallelism > 1 then it should either be empty to num_gpus set to zero, refer this issue and this sample for further understanding.
MODEL_ID is hugging face model id, which model to pull.
PIPELINE_PARALLELISM is set to 2, since we want to shard our model among two worker nodes. After sending request we can visit localhost:8265 and under serve our application will be under deployment it usually takes some time depending on the system.

Inference 🎯

After application is under "healthy" state we can finally inference our model. So to do so first port-forward 8000 from the same head-svc that we prot-forwarded ray serve and then send the POST request with below payload to http://localhost:8000/v1/chat/completions

{
    "model": "deepseek_r1",
    "messages": [
        {
            "role": "user",
            "content": "think and tell which shape has 6 sides?"
        }
    ]
}

NOTE: model: deepseek_r1 is same that we passed to ray serve

And done 🥳🥳!!! Congrats on running a 32B deepseek-r1 model 🥂🥂

Top comments (2)

王磊 • Feb 9

for distributed inference, I think the performance is a big issue here, do we have the benchmark to compare with a single node inference? not sure what's the downgrade

Ishan Mishra • Feb 15

Hello,
Yup when distributed on different systems via pipeline-parallelism then performance is impacted, and it depends on the network between two systems as well. But the same model can be distributed via tensor-parallelism among the same system but different gpus, which makes inference faster and KV cache is also shared.
Will make a video about these bechmarks soon!

DEV Community

Supercharging Deepseek-R1 with Ray + vLLM: A Distributed System Approach

Intended Audience 👤

Intro 👋

Setting up kubernetes ☸️

Creating ray cluster 👨‍👨‍👦‍👦

Deploy ray serve application 🚀

Inference 🎯

Top comments (2)

Read next

How I Built a Teen Slang Translator with GitHub Copilot and Claude 3.7 Sonnet

The Intelligent Loop: A Guide to Modern LLM Agents

🧠🤖Image generative AI on PC without GPU (free and fast (FastSD CPU))

Simple AI Sound Mixer in Python