DEV Community

Ajeet Singh Raina
Ajeet Singh Raina

Posted on

My Journey with Deepseek R1 on NVIDIA Jetson Orin Nano Super using Docker and Ollama

DeepSeek R1 is an innovative large language model developed by DeepSeek AI, designed specifically for efficient deployment on resource-constrained edge devices. Unlike many larger models that require substantial computational resources, DeepSeek R1 delivers impressive language capabilities in a compact form factor, making it ideal for hardware like the Jetson Orin series.

The 1.5B parameter version represents an excellent balance between performance and efficiency. It features an optimized architecture that maintains strong reasoning capabilities, code generation, and contextual understanding while significantly reducing memory and computational requirements. This smaller footprint allows it to run directly on edge devices without compromising too much on quality.

DeepSeek R1 has been quantized using various techniques (like the q4f16_ft version we'll be using) to further enhance its efficiency on GPU-accelerated platforms. These optimizations make it possible to deploy advanced AI capabilities in scenarios where cloud connectivity might be limited or where privacy and latency concerns necessitate local processing.

In this post, I'll share my journey setting up various AI tools on the Jetson Orin Nano Super, from basic system verification to running sophisticated language models.

Hardware Setup

  • NVIDIA Jetson Orin Nano
  • Power Cable
  • NVMe SSD card
  • Wireless Network USB
  • Jumper wire(for initial setup)
  • A Linux Laptop with SDK Manager

Flashing OS

Image1

You will need put the board into a recovery mode by using a jumper pin. On the Jetson Orin Nano, look for the FC REC pin and GND pin on the GPIO header.The FC REC pin is typically marked on the board (check the Jetson Orin Nano Developer Kit pinout diagram). Place the jumper across the FC REC and GND pins.This forces the board into Force Recovery Mode (RCM mode) when powered on. Power On the Board.
Connect the power supply while keeping the jumper in place. Remove the Jumper (Optional, if required). Connect the Jetson Orin Nano to a Linux PC via USB Type-C cable.

Image2

Start SDK Manager software on the Linux system and flash this board with JetPack 6.2. Choose NVMe SSD instead of SD card. Once the board has powered on, you can remove the jumper if necessary. Verify Recovery Mode on Host PC.

Verifying NVMe Storage

First, I checked if the OS was properly installed on the NVMe storage:

df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p1  456G   11G  422G   3% /
Enter fullscreen mode Exit fullscreen mode

This confirmed that the system was running from the fast NVMe storage with plenty of space available (456GB total with 422GB free).

Installing Ollama

The first AI tool I installed was Ollama, which provides a convenient way to run large language models locally:

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

The installation went smoothly, with the script automatically configuring Ollama for the NVIDIA Jetpack environment:

Installing ollama to /usr/local
Downloading Linux arm64 bundle
######################################################################## 100.0%
>>> Downloading JetPack 6 components
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA JetPack ready.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Enter fullscreen mode Exit fullscreen mode

Setting Up OpenWebUI Using Docker

To provide a user-friendly interface for interacting with Ollama, I deployed OpenWebUI using Docker:

sudo docker run -d --network=host \
  -v ${HOME}/open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main
Enter fullscreen mode Exit fullscreen mode

I confirmed the container was running successfully:

sudo docker ps
CONTAINER ID   IMAGE                                COMMAND           CREATED          STATUS                             PORTS     NAMES
c4de9820647e   ghcr.io/open-webui/open-webui:main   "bash start.sh"   11 seconds ago   Up 10 seconds (health: starting)             open-webui
Enter fullscreen mode Exit fullscreen mode

Running DeepSeek R1 with Ollama

Next, I tested Ollama with a compact but capable language model, DeepSeek R1:

ollama run deepseek-r1:1.5b --verbose
The model download proceeded smoothly:
Copypulling manifest
pulling aabd4debf0c8... 100% ▕████████████████▏ 1.1 GB
pulling 369ca498f347... 100% ▕████████████████▏  387 B
pulling 6e4c38e1172f...   0% ▕                ▏    0 B
pulling f4d24e9138dd...   0% ▕                ▏    0 B
pulling a85fe2a2e58e... 100% ▕████████████████▏  487 B
verifying sha256 digest
writing manifest
success
Enter fullscreen mode Exit fullscreen mode

Testing the Model with a Prompt

I tested the model with a practical prompt relevant to the Jetson platform:

Write a short Python script to capture and process images using OpenCV on Jetson
Enter fullscreen mode Exit fullscreen mode

The model generated a detailed Python script with explanations of key components for image capture and processing on Jetson hardware. The response included proper imports, device number retrieval, camera configuration, image capture, and processing steps.

Analyzing Model Performance

After the completion of the response, Ollama provided performance metrics:

total duration:       30.039637931s
load duration:        60.895507ms
prompt eval count:    19 token(s)
prompt eval duration: 282ms
prompt eval rate:     67.38 tokens/s
eval count:           920 token(s)
eval duration:        29.693s
eval rate:            30.98 tokens/s
Enter fullscreen mode Exit fullscreen mode

These metrics revealed:

  • Total processing time: 30.04 seconds
  • Model loading time: 60.9 milliseconds (very fast)
  • Prompt evaluation: 19 tokens processed at 67.38 tokens/second
  • Response generation: 920 tokens generated at 30.98 tokens/second

The performance is quite impressive for an edge device, with the model loading quickly and generating nearly 31 tokens per second.

Using Docker Compose for Advanced Deployment

For a more sophisticated setup, I used the following Docker Compose configuration to manage multiple services. I downloaded it from this link

Image5

I faced the following error message:

sudo docker compose up -d
[sudo] password for ajeetraina:
[+] Running 1/1
 ✔ llm-server Pulled                                                                                    2.8s
[+] Running 0/1
 ⠸ Container llm_server  Starting                                                                       0.3s
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'csv'
invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead.: unknown
Enter fullscreen mode Exit fullscreen mode

This error indicated that the Jetson's Docker configuration was having issues with the newer deploy resources syntax for GPU access. The error specifically suggests using the older runtime: nvidia approach instead of the newer GPU device specification

Modified Docker Compose Configuration

I adjusted my approach to use the older runtime: nvidia syntax instead:

 services:
  llm-server:
    stdin_open: true
    tty: true
    container_name: llm_server
    network_mode: host
    runtime: nvidia
    ports:
      - 9000:9000
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - DOCKER_PULL=always
      - HF_HUB_CACHE=/root/.cache/huggingface
Enter fullscreen mode Exit fullscreen mode

Important changes and optimizations include:

  • Using runtime: nvidia instead of the newer deploy resources syntax
  • Using host networking mode for optimal performance
  • Volume mounts for caching model weights
  • Health checks to ensure service availability
  • Added --prefill-chunk 2048 parameter to reduce memory usage

Deploying with Docker Compose

I launched the compose setup with:

sudo docker compose up -d
Enter fullscreen mode Exit fullscreen mode

The output confirmed successful deployment:

[+] Running 1/1
 ✔ llm-server Pulled                                                                                                                                         2.8s
[+] Running 1/1
 ✔ Container llm_server  Started                                                                                                                             0.6s
Enter fullscreen mode Exit fullscreen mode

Checking the container status:

sudo docker compose ps
NAME         IMAGE                 COMMAND                  SERVICE      CREATED         STATUS                            PORTS
llm_server   dustynv/mlc:r36.4.0   "sudonim serve --mod…"   llm-server   6 seconds ago   Up 5 seconds (health: starting)
Enter fullscreen mode Exit fullscreen mode

Memory Challenges on the Jetson Orin

When running the LLM server, I initially encountered memory limitations with the Jetson Orin Nano. The following error appeared in the logs:

[bt] (2) /usr/local/lib/python3.10/dist-packages/mlc_llm/libmlc_llm_module.so(+0x303cfc) [0xffff64853cfc]
[bt] (1) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x68) [0xffff898dc7f8]
[bt] (0) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff8b7ebfc0]
File "/opt/mlc-llm/cpp/serve/threaded_engine.cc", line 287
TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 6476.857 MB, which is less than the sum of model weight size (876.640 MB) and temporary buffer size (10771.183 MB).
1. You can set a larger "gpu_memory_utilization" value.
2. If the model weight size is too large, please enable tensor parallelism by passing `--tensor-parallel-shards $NGPU` to `mlc_llm gen_config` or use quantization.
3. If the temporary buffer size is too large, please use a smaller `--prefill-chunk-size` in `mlc_llm gen_config`.
Enter fullscreen mode Exit fullscreen mode

This error message revealed the core challenge:

  • Available GPU memory: ~6.5GB
  • Model weight size: ~0.9GB
  • Temporary buffer needed: ~10.8GB
  • Total required: ~11.7GB (exceeding available memory)

The error suggested three potential solutions:

  • Increase GPU memory utilization (not applicable as we were already pushing the limits)
  • Enable tensor parallelism (not applicable for single-GPU setups like the Jetson)
  • Reduce the prefill chunk size

I opted for the third solution by adding the --prefill-chunk 2048 parameter to the command, which reduced the temporary buffer size enough to fit within the available memory. This parameter essentially limits how much context the model processes at once, trading off some efficiency for reduced memory usage.

Let's fix this by following option #3 in the error message. I reduced the temporary buffer size by using a smaller prefill chunk size.
Here's how I modified Docker Compose file:

command: sudonim serve --model
  dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization
  q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host
  0.0.0.0 --port 9000 --prefill-chunk 1024
Enter fullscreen mode Exit fullscreen mode

Re-run the Compose Services:

sudo docker compose logs -f
llm_server  |
llm_server  | Running auto update command in /opt/sudonim
llm_server  |   git pull && pip3 install --upgrade-strategy only-if-needed -e .
llm_server  | remote: Enumerating objects: 42, done.
remote: Counting objects: 100% (40/40), done.
remote: Compressing objects: 100% (7/7), done.
llm_server  | remote: Total 24 (delta 19), reused 22 (delta 17), pack-reused 0 (from 0)
Unpacking objects: 100% (24/24), 2.46 KiB | 209.00 KiB/s, done.
llm_server  | From https://github.com/dusty-nv/sudonim
llm_server  |    7f568e6..c8d2f98  main       -> origin/main
llm_server  | Updating 7f568e6..c8d2f98
llm_server  | Fast-forward
llm_server  |  pyproject.toml            |  2 +-
llm_server  |  sudonim/runners/export.py | 28 +++++++++++++++++++++++++---
llm_server  |  sudonim/runtimes/mlc.py   |  2 +-
llm_server  |  sudonim/utils/docker.py   | 20 +++++++++++++++++---
llm_server  |  4 files changed, 44 insertions(+), 8 deletions(-)
llm_server  | Looking in indexes: https://pypi.jetson-ai-lab.dev/jp6/cu126
llm_server  | Obtaining file:///opt/sudonim
Enter fullscreen mode Exit fullscreen mode

So far, everything looks normal - it's pulling the latest code from the repository and updating the package.

face_hub->sudonim==0.1.7) (6.0.2)
llm_server  | Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub->sudonim==0.1.7) (4.67.1)
llm_server  | Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub->sudonim==0.1.7) (4.12.2)
llm_server  | Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->docker->sudonim==0.1.7) (3.4.1)
llm_server  | Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->docker->sudonim==0.1.7) (3.10)
llm_server  | Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->docker->sudonim==0.1.7) (2024.12.14)
llm_server  | Building wheels for collected packages: sudonim
llm_server  |   Building editable for sudonim (pyproject.toml) ... done
llm_server  |   Created wheel for sudonim: filename=sudonim-0.1.7-0.editable-py3-none-any.whl size=4464 sha256=dddf389242d52ea17d83e1fb4fa2a438b10fb22a7c913c4df4695b069c8764cb
llm_server  |   Stored in directory: /tmp/pip-ephem-wheel-cache-_jbwjsiw/wheels/4a/b4/2c/516abc57fcca2f71adffc99520e49ddd6c2ee2be8fd0dd86b6
llm_server  | Successfully built sudonim
llm_server  | Installing collected packages: sudonim
llm_server  |   Attempting uninstall: sudonim
llm_server  |     Found existing installation: sudonim 0.1.6
llm_server  |     Uninstalling sudonim-0.1.6:
llm_server  |       Successfully uninstalled sudonim-0.1.6
llm_server  | Successfully installed sudonim-0.1.7
llm_server  |
llm_server  | [12:16:50] sudonim | sudonim version 0.1.7
llm_server  |
llm_server  | ┌──────────────────────────┬─────────────────────────────┬──────────────────────────────┐
llm_server  | │ CUDA_VERSION   12.6      │ GPU 0                       │ CACHE_ROOT      /root/.cache │
llm_server  | │ NVIDIA_DRIVER  540.4.0   │  ├ name      Orin Nano 8GB  │ HAS_MLC         True         │
llm_server  | │ SYSTEM_ID      orin-nano │  ├ family    Ampere         │ HAS_HF_HUB      True         │
llm_server  | │ CPU_ARCH       aarch64   │  ├ cores     1024           │ HAS_NVIDIA_SMI  True         │
llm_server  | │ GPU_ARCH       sm87      │  ├ mem_free  [5.0 / 7.6 GB] │                              │
llm_server  | └──────────────────────────┴─────────────────────────────┴──────────────────────────────┘
llm_server  |
llm_server  | [12:16:50] sudonim | Downloading model from HF Hub:  dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC -> /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
llm_server  | /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
llm_server  |   warnings.warn(
Fetching 38 files: 100% 38/38 [00:00<00:00, 828.06it/s]
llm_server  | [12:16:50] sudonim | Downloaded model dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC to:  /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
llm_server  | [12:16:50] sudonim | Loading model 'DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC' from /root/.cache/mlc_llm/dusty-nv
llm_server  |
llm_server  |   mlc_llm serve --mode interactive --device cuda \
llm_server  |     --host 0.0.0.0 --port 9000 \
llm_server  |     --overrides='tensor_parallel_shards=1;prefill_chunk_size=1024' \
llm_server  |     --model-lib /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC/aarch64-cu126-sm87.so \
llm_server  |     DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
llm_server  |
Enter fullscreen mode Exit fullscreen mode

Here's the final working Compose file:

services:
  llm-server:
    stdin_open: true
    tty: true
    container_name: llm_server
    network_mode: host
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - DOCKER_PULL=always
      - HF_HUB_CACHE=/root/.cache/huggingface
    pull_policy: always
    volumes:
      - /mnt/nvme/cache:/root/.cache
    image: dustynv/mlc:r36.4.0
    command: sudonim serve --model
      dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization
      q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host
      0.0.0.0 --port 9000 --prefill-chunk 1024
    healthcheck:
      test: ["CMD", "curl", "-f", "http://0.0.0.0:9000/v1/models"]
      interval: 20s
      timeout: 60s
      retries: 45
      start_period: 15s

  perf-bench:
    profiles:
      - perf-bench
    depends_on:
      llm-server:
        condition: service_healthy
    stdin_open: true
    tty: true
    network_mode: host
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /mnt/nvme/cache:/root/.cache
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - DOCKER_PULL=always
      - HF_HUB_CACHE=/root/.cache/huggingface
    pull_policy: always
    image: dustynv/mlc:r36.4.0
    command: sudonim bench stop --model
      dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization
      q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host
      0.0.0.0 --port 9000 --prefill-chunk 1024

  open-webui:
    profiles:
      - open-webui
    depends_on:
      llm-server:
        condition: service_healthy
    stdin_open: true
    tty: true
    container_name: open-webui
    network_mode: host
    environment:
      - ENABLE_OPENAI_API=True
      - ENABLE_OLLAMA_API=False
      - OPENAI_API_BASE_URL=http://0.0.0.0:9000/v1
      - OPENAI_API_KEY=foo
      - DOCKER_PULL=always
      - HF_HUB_CACHE=/root/.cache/huggingface
    volumes:
      - /mnt/nvme/cache/open-webui:/app/backend/data
      - /mnt/nvme/cache:/root/.cache
    pull_policy: always
    image: ghcr.io/open-webui/open-webui:main
Enter fullscreen mode Exit fullscreen mode

You're still encountering the same memory error. The --prefill-chunk 1024 parameter isn't reducing the memory requirements enough. Let's try an even smaller chunk size and look at other options:

First, modify your compose file to use an even smaller prefill chunk size:

command: sudonim serve --model
  dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization
  q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host
  0.0.0.0 --port 9000 --prefill-chunk 512
Enter fullscreen mode Exit fullscreen mode

After making these changes:

sudo docker compose down
sudo docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Conclusion

The NVIDIA Jetson Orin Nano Super proves to be an impressive platform for edge AI deployment. Despite some memory constraints that required careful optimization, I was able to successfully run sophisticated language models directly on the device.

The combination of Docker for containerization, Ollama for model management, and OpenWebUI for user interaction creates a powerful, self-contained AI development environment. This setup demonstrates the potential for deploying AI capabilities in scenarios where cloud connectivity might be limited or privacy concerns necessitate local processing.

For developers interested in edge AI, the Jetson Orin Nano Super represents an excellent balance of performance, power efficiency, and affordability. Its ability to run models like DeepSeek R1 with reasonable performance opens up exciting possibilities for creative applications in robotics, computer vision, natural language processing, and more.

Top comments (0)