Romulo Franca

Posted on Feb 11

Unlock the Future: Build Your Own Private "ChatGPT" in 30 Minutes with Kubernetes, Ollama, and NVIDIA 🤖

#ai #kubernetes #chatgpt #devops

Imagine running ChatGPT-style AI entirely on your own hardware—no cloud lock-in, no API limits, no privacy risks. Just pure, high-performance AI under your control.

Big Tech doesn’t want you to know this, but you don’t need OpenAI or cloud GPUs to build your own AI chatbot. With Kubernetes (K3s), NVIDIA GPUs, and Ollama, you can deploy a private, lightning-fast ChatGPT alternative in under 30 minutes.

This guide is perfect for:

✅ Developers & enterprises wanting full control over their AI

✅ Security-conscious teams keeping AI inside their private networks

✅ Tinkerers & AI enthusiasts looking to run custom LLMs on bare metal

Best of all? Your AI is now 10x faster—thanks to GPU acceleration—compared to CPU-bound inferencing. Let’s dive in! 🚀

Prerequisites ✅

You’ll need:

1️⃣ NVIDIA GPU (Required) – RTX 3090+, A100, or similar (Pascal+ for CUDA support)

2️⃣ NVIDIA Drivers & NVIDIA-SMI – Verify installation:

   nvidia-smi

3️⃣ Linux Distribution – Ubuntu 20.04+, Debian, Fedora

4️⃣ Docker & Kubernetes (K3s) – Installed on your machine

💡 Not sure if your GPU is supported? Run:

   nvidia-smi | grep "CUDA Version"

Step 1: Install Kubernetes (K3s) 🏗️

We’re using K3s, a lightweight Kubernetes distribution that’s perfect for rapid deployments.

Installation Steps:

Install K3s:

   curl -sfL https://get.k3s.io | sh -

Verify Installation:

   sudo k3s kubectl get node

Step 2: Deploy Ollama as a StatefulSet 🧠

Why a StatefulSet?

AI models require persistent storage (so you don’t redownload them on every restart).
StatefulSets ensure model files stay intact across pod restarts.

Save the following as ollama-statefulset.yaml:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
spec:
  serviceName: "ollama"
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: compute,utility
        - name: OLLAMA_DEBUG
          value: "1"
        volumeMounts:
        - name: models
          mountPath: /root/.ollama
        resources:
          limits:
            nvidia.com/gpu: 1
  volumeClaimTemplates:
  - metadata:
      name: models
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: local-path
      resources:
        requests:
          storage: 10Gi

Deploy the StatefulSet:

kubectl apply -f ollama-statefulset.yaml

Step 3: Deploy Open WebUI 🌐

Create the Deployment

Save the following as open-webui-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
    spec:
      containers:
      - name: open-webui
        image: ghcr.io/open-webui/open-webui:latest
        ports:
        - containerPort: 3000
        env:
        - name: OLLAMA_BASE_URL
          value: "http://ollama"

Deploy Open WebUI:

kubectl apply -f open-webui-deployment.yaml

Step 4: Access Your Private ChatGPT 🚀

Use port forwarding to access Open WebUI:

kubectl port-forward svc/open-webui-service 8080:80

Now open your browser and visit:

http://localhost:8080

Next Steps: Real-World Scaling & Optimizations 🏆

1️⃣ Serve via DNS & Load Balancer

Instead of using kubectl port-forward, expose your chatbot via Ingress + LoadBalancer.
Add TLS encryption via Cert-Manager + Let’s Encrypt.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: chatgpt-ingress
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  rules:
    - host: chatgpt.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: open-webui-service
                port:
                  number: 80
  tls:
    - hosts:
        - chatgpt.yourdomain.com
      secretName: chatgpt-tls

2️⃣ Optimize GPU Utilization

Use NVIDIA MPS (Multi-Process Service) to split GPU resources across multiple models dynamically.
Deploy multiple AI models in parallel (e.g., LLaMA + Mistral) by configuring model-specific resourceRequests.

3️⃣ CI/CD Automation for Model Updates

Automate deployment with ArgoCD or GitHub Actions for rolling AI model updates.
Use image versioning tags (e.g., ollama/ollama:1.2.3) to avoid accidental updates breaking your chatbot.

4️⃣ Performance Monitoring (GPU Metrics & AI Response Times)

Integrate Prometheus + Grafana to track: ✅ GPU memory usage ✅ Inference time per request ✅ Active model sessions

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ollama-monitor
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: ollama
  endpoints:
  - port: metrics
    interval: 15s

🎯 Final Thoughts

You’ve just built your own private, GPU-powered ChatGPT using Kubernetes, Ollama, and Open WebUI. But this is just the beginning!

🔹 Next Challenges:

1️⃣ Deploy multiple AI models (LLaMA + Mistral) with GPU partitioning.

2️⃣ Add user authentication (OAuth2 or Keycloak).

3️⃣ Fine-tune AI models on your own domain-specific dataset.

💬 What’s your next AI project? Drop a comment below and let’s build something epic together! 🚀

DEV Community