Ajeet Singh Raina

Posted on Mar 9

My Journey with Deepseek R1 on NVIDIA Jetson Orin Nano Super using Docker and Ollama

DeepSeek R1 is an innovative large language model developed by DeepSeek AI, designed specifically for efficient deployment on resource-constrained edge devices. Unlike many larger models that require substantial computational resources, DeepSeek R1 delivers impressive language capabilities in a compact form factor, making it ideal for hardware like the Jetson Orin series.

The 1.5B parameter version represents an excellent balance between performance and efficiency. It features an optimized architecture that maintains strong reasoning capabilities, code generation, and contextual understanding while significantly reducing memory and computational requirements. This smaller footprint allows it to run directly on edge devices without compromising too much on quality.

DeepSeek R1 has been quantized using various techniques (like the q4f16_ft version we'll be using) to further enhance its efficiency on GPU-accelerated platforms. These optimizations make it possible to deploy advanced AI capabilities in scenarios where cloud connectivity might be limited or where privacy and latency concerns necessitate local processing.

In this post, I'll share my journey setting up various AI tools on the Jetson Orin Nano Super, from basic system verification to running sophisticated language models.

Hardware Setup

NVIDIA Jetson Orin Nano
Power Cable
NVMe SSD card
Wireless Network USB
Jumper wire(for initial setup)
A Linux Laptop with SDK Manager

Flashing OS

You will need put the board into a recovery mode by using a jumper pin. On the Jetson Orin Nano, look for the FC REC pin and GND pin on the GPIO header.The FC REC pin is typically marked on the board (check the Jetson Orin Nano Developer Kit pinout diagram). Place the jumper across the FC REC and GND pins.This forces the board into Force Recovery Mode (RCM mode) when powered on. Power On the Board.
Connect the power supply while keeping the jumper in place. Remove the Jumper (Optional, if required). Connect the Jetson Orin Nano to a Linux PC via USB Type-C cable.

Start SDK Manager software on the Linux system and flash this board with JetPack 6.2. Choose NVMe SSD instead of SD card. Once the board has powered on, you can remove the jumper if necessary. Verify Recovery Mode on Host PC.

Verifying NVMe Storage

First, I checked if the OS was properly installed on the NVMe storage:

df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p1  456G   11G  422G   3% /

This confirmed that the system was running from the fast NVMe storage with plenty of space available (456GB total with 422GB free).

Installing Ollama

The first AI tool I installed was Ollama, which provides a convenient way to run large language models locally:

curl -fsSL https://ollama.com/install.sh | sh

The installation went smoothly, with the script automatically configuring Ollama for the NVIDIA Jetpack environment:

Installing ollama to /usr/local
Downloading Linux arm64 bundle
######################################################################## 100.0%
>>> Downloading JetPack 6 components
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA JetPack ready.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.

Setting Up OpenWebUI Using Docker

To provide a user-friendly interface for interacting with Ollama, I deployed OpenWebUI using Docker:

sudo docker run -d --network=host \
  -v ${HOME}/open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

I confirmed the container was running successfully:

sudo docker ps
CONTAINER ID   IMAGE                                COMMAND           CREATED          STATUS                             PORTS     NAMES
c4de9820647e   ghcr.io/open-webui/open-webui:main   "bash start.sh"   11 seconds ago   Up 10 seconds (health: starting)             open-webui

Running DeepSeek R1 with Ollama

Next, I tested Ollama with a compact but capable language model, DeepSeek R1:

ollama run deepseek-r1:1.5b --verbose
The model download proceeded smoothly:
Copypulling manifest
pulling aabd4debf0c8... 100% ▕████████████████▏ 1.1 GB
pulling 369ca498f347... 100% ▕████████████████▏  387 B
pulling 6e4c38e1172f...   0% ▕                ▏    0 B
pulling f4d24e9138dd...   0% ▕                ▏    0 B
pulling a85fe2a2e58e... 100% ▕████████████████▏  487 B
verifying sha256 digest
writing manifest
success

Testing the Model with a Prompt

I tested the model with a practical prompt relevant to the Jetson platform:

Write a short Python script to capture and process images using OpenCV on Jetson

The model generated a detailed Python script with explanations of key components for image capture and processing on Jetson hardware. The response included proper imports, device number retrieval, camera configuration, image capture, and processing steps.

Analyzing Model Performance

After the completion of the response, Ollama provided performance metrics:

total duration:       30.039637931s
load duration:        60.895507ms
prompt eval count:    19 token(s)
prompt eval duration: 282ms
prompt eval rate:     67.38 tokens/s
eval count:           920 token(s)
eval duration:        29.693s
eval rate:            30.98 tokens/s

These metrics revealed:

Total processing time: 30.04 seconds
Model loading time: 60.9 milliseconds (very fast)
Prompt evaluation: 19 tokens processed at 67.38 tokens/second
Response generation: 920 tokens generated at 30.98 tokens/second

The performance is quite impressive for an edge device, with the model loading quickly and generating nearly 31 tokens per second.

Using Docker Compose for Advanced Deployment

For a more sophisticated setup, I used the following Docker Compose configuration to manage multiple services. I downloaded it from this link

I faced the following error message:

sudo docker compose up -d
[sudo] password for ajeetraina:
[+] Running 1/1
 ✔ llm-server Pulled                                                                                    2.8s
[+] Running 0/1
 ⠸ Container llm_server  Starting                                                                       0.3s
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'csv'
invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead.: unknown

This error indicated that the Jetson's Docker configuration was having issues with the newer deploy resources syntax for GPU access. The error specifically suggests using the older runtime: nvidia approach instead of the newer GPU device specification

Modified Docker Compose Configuration

I adjusted my approach to use the older runtime: nvidia syntax instead:

 services:
  llm-server:
    stdin_open: true
    tty: true
    container_name: llm_server
    network_mode: host
    runtime: nvidia
    ports:
      - 9000:9000
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - DOCKER_PULL=always
      - HF_HUB_CACHE=/root/.cache/huggingface

Important changes and optimizations include:

Using runtime: nvidia instead of the newer deploy resources syntax
Using host networking mode for optimal performance
Volume mounts for caching model weights
Health checks to ensure service availability
Added --prefill-chunk 2048 parameter to reduce memory usage

Deploying with Docker Compose

I launched the compose setup with:

sudo docker compose up -d

The output confirmed successful deployment:

[+] Running 1/1
 ✔ llm-server Pulled                                                                                                                                         2.8s
[+] Running 1/1
 ✔ Container llm_server  Started                                                                                                                             0.6s

Checking the container status:

sudo docker compose ps
NAME         IMAGE                 COMMAND                  SERVICE      CREATED         STATUS                            PORTS
llm_server   dustynv/mlc:r36.4.0   "sudonim serve --mod…"   llm-server   6 seconds ago   Up 5 seconds (health: starting)

Memory Challenges on the Jetson Orin

When running the LLM server, I initially encountered memory limitations with the Jetson Orin Nano. The following error appeared in the logs:

[bt] (2) /usr/local/lib/python3.10/dist-packages/mlc_llm/libmlc_llm_module.so(+0x303cfc) [0xffff64853cfc]
[bt] (1) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x68) [0xffff898dc7f8]
[bt] (0) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff8b7ebfc0]
File "/opt/mlc-llm/cpp/serve/threaded_engine.cc", line 287
TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 6476.857 MB, which is less than the sum of model weight size (876.640 MB) and temporary buffer size (10771.183 MB).
1. You can set a larger "gpu_memory_utilization" value.
2. If the model weight size is too large, please enable tensor parallelism by passing `--tensor-parallel-shards $NGPU` to `mlc_llm gen_config` or use quantization.
3. If the temporary buffer size is too large, please use a smaller `--prefill-chunk-size` in `mlc_llm gen_config`.

This error message revealed the core challenge:

Available GPU memory: ~6.5GB
Model weight size: ~0.9GB
Temporary buffer needed: ~10.8GB
Total required: ~11.7GB (exceeding available memory)

The error suggested three potential solutions:

Increase GPU memory utilization (not applicable as we were already pushing the limits)
Enable tensor parallelism (not applicable for single-GPU setups like the Jetson)
Reduce the prefill chunk size

I opted for the third solution by adding the --prefill-chunk 2048 parameter to the command, which reduced the temporary buffer size enough to fit within the available memory. This parameter essentially limits how much context the model processes at once, trading off some efficiency for reduced memory usage.

Let's fix this by following option #3 in the error message. I reduced the temporary buffer size by using a smaller prefill chunk size.
Here's how I modified Docker Compose file:

command: sudonim serve --model
  dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization
  q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host
  0.0.0.0 --port 9000 --prefill-chunk 1024

Re-run the Compose Services:

sudo docker compose logs -f
llm_server  |
llm_server  | Running auto update command in /opt/sudonim
llm_server  |   git pull && pip3 install --upgrade-strategy only-if-needed -e .
llm_server  | remote: Enumerating objects: 42, done.
remote: Counting objects: 100% (40/40), done.
remote: Compressing objects: 100% (7/7), done.
llm_server  | remote: Total 24 (delta 19), reused 22 (delta 17), pack-reused 0 (from 0)
Unpacking objects: 100% (24/24), 2.46 KiB | 209.00 KiB/s, done.
llm_server  | From https://github.com/dusty-nv/sudonim
llm_server  |    7f568e6..c8d2f98  main       -> origin/main
llm_server  | Updating 7f568e6..c8d2f98
llm_server  | Fast-forward
llm_server  |  pyproject.toml            |  2 +-
llm_server  |  sudonim/runners/export.py | 28 +++++++++++++++++++++++++---
llm_server  |  sudonim/runtimes/mlc.py   |  2 +-
llm_server  |  sudonim/utils/docker.py   | 20 +++++++++++++++++---
llm_server  |  4 files changed, 44 insertions(+), 8 deletions(-)
llm_server  | Looking in indexes: https://pypi.jetson-ai-lab.dev/jp6/cu126
llm_server  | Obtaining file:///opt/sudonim

So far, everything looks normal - it's pulling the latest code from the repository and updating the package.

face_hub->sudonim==0.1.7) (6.0.2)
llm_server  | Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub->sudonim==0.1.7) (4.67.1)
llm_server  | Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub->sudonim==0.1.7) (4.12.2)
llm_server  | Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->docker->sudonim==0.1.7) (3.4.1)
llm_server  | Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->docker->sudonim==0.1.7) (3.10)
llm_server  | Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->docker->sudonim==0.1.7) (2024.12.14)
llm_server  | Building wheels for collected packages: sudonim
llm_server  |   Building editable for sudonim (pyproject.toml) ... done
llm_server  |   Created wheel for sudonim: filename=sudonim-0.1.7-0.editable-py3-none-any.whl size=4464 sha256=dddf389242d52ea17d83e1fb4fa2a438b10fb22a7c913c4df4695b069c8764cb
llm_server  |   Stored in directory: /tmp/pip-ephem-wheel-cache-_jbwjsiw/wheels/4a/b4/2c/516abc57fcca2f71adffc99520e49ddd6c2ee2be8fd0dd86b6
llm_server  | Successfully built sudonim
llm_server  | Installing collected packages: sudonim
llm_server  |   Attempting uninstall: sudonim
llm_server  |     Found existing installation: sudonim 0.1.6
llm_server  |     Uninstalling sudonim-0.1.6:
llm_server  |       Successfully uninstalled sudonim-0.1.6
llm_server  | Successfully installed sudonim-0.1.7
llm_server  |
llm_server  | [12:16:50] sudonim | sudonim version 0.1.7
llm_server  |
llm_server  | ┌──────────────────────────┬─────────────────────────────┬──────────────────────────────┐
llm_server  | │ CUDA_VERSION   12.6      │ GPU 0                       │ CACHE_ROOT      /root/.cache │
llm_server  | │ NVIDIA_DRIVER  540.4.0   │  ├ name      Orin Nano 8GB  │ HAS_MLC         True         │
llm_server  | │ SYSTEM_ID      orin-nano │  ├ family    Ampere         │ HAS_HF_HUB      True         │
llm_server  | │ CPU_ARCH       aarch64   │  ├ cores     1024           │ HAS_NVIDIA_SMI  True         │
llm_server  | │ GPU_ARCH       sm87      │  ├ mem_free  [5.0 / 7.6 GB] │                              │
llm_server  | └──────────────────────────┴─────────────────────────────┴──────────────────────────────┘
llm_server  |
llm_server  | [12:16:50] sudonim | Downloading model from HF Hub:  dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC -> /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
llm_server  | /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
llm_server  |   warnings.warn(
Fetching 38 files: 100% 38/38 [00:00<00:00, 828.06it/s]
llm_server  | [12:16:50] sudonim | Downloaded model dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC to:  /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
llm_server  | [12:16:50] sudonim | Loading model 'DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC' from /root/.cache/mlc_llm/dusty-nv
llm_server  |
llm_server  |   mlc_llm serve --mode interactive --device cuda \
llm_server  |     --host 0.0.0.0 --port 9000 \
llm_server  |     --overrides='tensor_parallel_shards=1;prefill_chunk_size=1024' \
llm_server  |     --model-lib /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC/aarch64-cu126-sm87.so \
llm_server  |     DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
llm_server  |

Here's the final working Compose file:

services:
  llm-server:
    stdin_open: true
    tty: true
    container_name: llm_server
    network_mode: host
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - DOCKER_PULL=always
      - HF_HUB_CACHE=/root/.cache/huggingface
    pull_policy: always
    volumes:
      - /mnt/nvme/cache:/root/.cache
    image: dustynv/mlc:r36.4.0
    command: sudonim serve --model
      dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization
      q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host
      0.0.0.0 --port 9000 --prefill-chunk 1024
    healthcheck:
      test: ["CMD", "curl", "-f", "http://0.0.0.0:9000/v1/models"]
      interval: 20s
      timeout: 60s
      retries: 45
      start_period: 15s

  perf-bench:
    profiles:
      - perf-bench
    depends_on:
      llm-server:
        condition: service_healthy
    stdin_open: true
    tty: true
    network_mode: host
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /mnt/nvme/cache:/root/.cache
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - DOCKER_PULL=always
      - HF_HUB_CACHE=/root/.cache/huggingface
    pull_policy: always
    image: dustynv/mlc:r36.4.0
    command: sudonim bench stop --model
      dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization
      q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host
      0.0.0.0 --port 9000 --prefill-chunk 1024

  open-webui:
    profiles:
      - open-webui
    depends_on:
      llm-server:
        condition: service_healthy
    stdin_open: true
    tty: true
    container_name: open-webui
    network_mode: host
    environment:
      - ENABLE_OPENAI_API=True
      - ENABLE_OLLAMA_API=False
      - OPENAI_API_BASE_URL=http://0.0.0.0:9000/v1
      - OPENAI_API_KEY=foo
      - DOCKER_PULL=always
      - HF_HUB_CACHE=/root/.cache/huggingface
    volumes:
      - /mnt/nvme/cache/open-webui:/app/backend/data
      - /mnt/nvme/cache:/root/.cache
    pull_policy: always
    image: ghcr.io/open-webui/open-webui:main

You're still encountering the same memory error. The --prefill-chunk 1024 parameter isn't reducing the memory requirements enough. Let's try an even smaller chunk size and look at other options:

First, modify your compose file to use an even smaller prefill chunk size:

command: sudonim serve --model
  dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization
  q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host
  0.0.0.0 --port 9000 --prefill-chunk 512

After making these changes:

sudo docker compose down
sudo docker compose up -d

Conclusion

The NVIDIA Jetson Orin Nano Super proves to be an impressive platform for edge AI deployment. Despite some memory constraints that required careful optimization, I was able to successfully run sophisticated language models directly on the device.

The combination of Docker for containerization, Ollama for model management, and OpenWebUI for user interaction creates a powerful, self-contained AI development environment. This setup demonstrates the potential for deploying AI capabilities in scenarios where cloud connectivity might be limited or privacy concerns necessitate local processing.

For developers interested in edge AI, the Jetson Orin Nano Super represents an excellent balance of performance, power efficiency, and affordability. Its ability to run models like DeepSeek R1 with reasonable performance opens up exciting possibilities for creative applications in robotics, computer vision, natural language processing, and more.

DEV Community