DEV Community

Cover image for Run DeepSeek locally on BareMetal via K8s
Anas Alloush
Anas Alloush

Posted on

Run DeepSeek locally on BareMetal via K8s

What is DeepSeek?

Just as OpenAI has the ChatGPT chatbot, DeepSeek also has a similar chatbot, and it comes with two models: DeepSeek-V3 and DeepSeek-R1.

**DeepSeek-V3 **is the default model used when we interact with the DeepSeek app. It’s a versatile large language model (LLM) that stands out as a general-purpose tool that can handle a wide range of tasks.

Image description

DeepSeek-R1 is a powerful reasoning model built for solving tasks that require advanced reasoning and deep problem-solving. It works great for coding challenges that go beyond regurgitating code that has been written thousands of times and logic-heavy questions.

Image description

What gave DeepSeek the blazing fame, was the mathematical tricks used to enhance efficiency and compensate for the performance improvements of Nvidia’s GPU and Collective Communications Library (NCCL)

In short, DeepSeek used smart math to avoid using expensive HW like Nvidia’s H100 GPU for training large datasets.
Also affected how the model can be run and used, hence we can run it on local servers with normal (no GPU) compute powers.


DeepSeek’s Mathematical Tricks for Computational Efficiency:

1. Low-Rank Approximations for Faster Computation
One of DeepSeek’s key optimizations is low-rank matrix approximations, which reduce the number of operations needed in matrix multiplications. Instead of performing full-rank matrix multiplications, these methods approximate matrices with lower-dimensional representations, significantly reducing computational cost.

2. Grouped Query Attention (GQA) for Memory Savings
GQA restructures how attention is computed in transformer models, reducing the memory bandwidth required for attention operations. Instead of computing attention for each query separately, GQA allows multiple queries to share the same key-value pairs, leading to:

  • Lower memory consumption
  • Faster inference speeds
  • Reduced redundant computations

3. Mixed-Precision Training for Speed and Efficiency
DeepSeek utilizes mixed-precision training, where computations use FP16/BF16 instead of FP32, reducing memory footprint and accelerating training. However, to maintain numerical stability, loss scaling techniques are applied, ensuring that small gradients are not lost due to precision truncation.

4. Quantization for Reduced Computational Complexity
Beyond mixed-precision, DeepSeek also benefits from quantization, where tensors are represented using lower-bit precision (e.g., INT8). This allows for faster matrix multiplications and reduced memory bandwidth consumption, making training more efficient.

5. Stochastic Rounding to Maintain Accuracy
When using lower-precision floating-point formats, stochastic rounding is employed to mitigate the accumulation of rounding errors, ensuring the model maintains high accuracy despite using reduced precision.

DeepSeek’s mathematical optimizations allowed for cheaper training and lighter models that we can run on servers/PCs with minimum HW.


Why would someone run a LLM model locally?

Running a large language model (LLM) locally offers several advantages, depending on the use case and requirements. Here are some key reasons why someone might choose to run an LLM locally:

1. Data Privacy and Security

  • Sensitive Data: When working with confidential or sensitive information (e.g., medical, legal, or proprietary business data), running the model locally ensures that the data never leaves your environment, reducing the risk of exposure or breaches.
  • Compliance: Local deployment can help meet regulatory requirements (e.g., GDPR, HIPAA) that mandate data to remain on-premises.

2. Control and Customization

  • Full Control: Running an LLM locally gives you complete control over the model, its configuration, and the infrastructure it runs on.
  • Customization: You can fine-tune or modify the model to better suit specific needs, which might not be possible or cost-effective with cloud-based APIs.

3. Cost Efficiency

  • Reduced API Costs: Cloud-based LLM services often charge based on usage (e.g., per token or API call). Running the model locally can be more cost-effective for high-volume or continuous usage.
  • No Subscription Fees: Local deployment avoids recurring subscription costs associated with cloud-based LLM services.

4. Performance and Latency

  • Lower Latency: Local deployment eliminates network latency, which is especially important for real-time applications or when low response times are critical.
  • Predictable Performance: You can optimize the hardware and software stack to ensure consistent performance, without being affected by external factors like cloud service outages or throttling.

5. Offline Accessibility

  • No Internet Dependency: Running the model locally allows you to use it in environments without reliable internet access, such as remote locations or secure facilities.
  • Disaster Recovery: Local deployment ensures that the model remains accessible even during internet outages or cloud service disruptions.

6. Transparency and Debugging

  • Model Transparency: Running the model locally allows you to inspect its behavior, outputs, and intermediate steps, which can be crucial for debugging or understanding its decision-making process.
  • Error Analysis: You can log and analyze errors or unexpected outputs more effectively when the model is under your control.

7. Long-Term Sustainability

  • Avoid Vendor Lock-In: By running the model locally, you are not dependent on a specific cloud provider or service, reducing the risk of vendor lock-in.
  • Future-Proofing: Local deployment ensures that you can continue using the model even if the cloud service changes its pricing, terms, or discontinues the service.

8. Research and Development
Researchers and developers can experiment with the model’s architecture, training data, or fine-tuning processes without restrictions imposed by cloud providers.


Hands-On Stuff:

There are many ways you can run DeepSeek locally.
As a fan of K8s and containers, I chose the containerized way.

Here I’m running DeepSeek-R1 model locally on a Kubernetes cluster. This cluster is running on a VM which is running on a personal Laptop. One might argue that such setup is not really a BareMetal setup which is correct, but the same K8s configuration can be used on BareMetal directly. In my case I ran it within a VM for convenience..

Setup specifications:

  • 1 VM with 32G RAM & 16 Cores / intel i9–9880H CPU @ 2.30GHz.
  • No GPUs used.
  • 50 GB Storage allocated to the VM.
  • Ubuntu 22.04.5 LTS.
  • Minikube K8s.
  • DeepSeek-r1 with 7-billion parameters (Ollama Docker Image).
  • Open Web UI.

Practical Steps:
1- Install any K8s distribution you feel comfortable working with.
Here I’m using Minikube K8s and allocating 14 vCPUs & 28GB Memory for the cluster.
minikube start - cpus=14 - memory=28672

2- Prepare K8s Persistence Volumes (PVs) of any type (Static or Dynamic), as later it will be consumed by 2 PersistentVolumeClaims (PVCs).

Note: If you are using a Minikube K8s, you can simply use the storage-provisioner-gluster add-on as explained here

3- Go to https://ollama.com/ and choose the model you want .. Here I’m choosing DeepSeek-R1 with 14 billion parameters. Choose less parametrs model depending on your machine resources.

Image description

4- Once you have a running K8s setup, run the below yaml configuration via kubectl..
It will download and run the images:

  • DeepSeek-R1 14b model.
  • open-webui.

It will also prepare the volumes required, and expose the Open-Webui GUI via port 11434 to interact with the model..

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: open-webui-storage
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-storage
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:                                                                                                                                                            
        app: open-webui                                                                                                                                                  
    spec:
      containers:
        - name: open-webui
          image: ghcr.io/open-webui/open-webui:latest
          env:
            - name: OLLAMA_BASE_URL
              value: "http://127.0.0.1:11434"
          volumeMounts:
            - mountPath: /app/backend/data
              name: open-webui-storage
      volumes:
        - name: open-webui-storage
          persistentVolumeClaim:
            claimName: open-webui-storage
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          volumeMounts:
            - mountPath: /root/.ollama
              name: ollama-storage
      volumes:
        - name: ollama-storage
          persistentVolumeClaim:
            claimName: ollama-storage
---
apiVersion: batch/v1
kind: Job
metadata:
  name: ollama-pull-llama
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: ollama-pull-llama
          image: ollama/ollama:latest
          command: ["/bin/sh", "-c", "sleep 3; OLLAMA_HOST=127.0.0.1:11434 ollama pull deepseek-r1:7b"]
          volumeMounts:
            - mountPath: /root/.ollama
              name: ollama-storage
      volumes:
        - name: ollama-storage
          persistentVolumeClaim:
            claimName: ollama-storage

Enter fullscreen mode Exit fullscreen mode

Depending on your internet connection, It will take around 2 min to pull all required images (~4.5GB) and run them on your K8s cluster.

Note: Before start using the model, the Ollama Pull job must be on Complete status. All other pods must be in Running status.

Image description


Moment of Truth — Test ,, Test

Now that we have a locally ready setup including a running model, its storage and an exposed port. lets start testing:

1st Questions: What is a Transistor?
Thinking and writing the answer took around ~50Sec

Image description

2nd Question: How many (A) letters is there in the name (Anas)?
Some models struggle with such questions. In my DeepSeek-r1 local setup it took around ~80Sec to reason, think and answer correctly.

Image description

Below is a snapshot from my linux VM with all allocated CPU going wild trying to run the model while answering the questions..

Image description

Happy Learning!

Top comments (0)