DeepSeek R1 is an innovative large language model developed by DeepSeek AI, designed specifically for efficient deployment on resource-constrained edge devices. Unlike many larger models that require substantial computational resources, DeepSeek R1 delivers impressive language capabilities in a compact form factor, making it ideal for hardware like the Jetson Orin series.
The 1.5B parameter version represents an excellent balance between performance and efficiency. It features an optimized architecture that maintains strong reasoning capabilities, code generation, and contextual understanding while significantly reducing memory and computational requirements. This smaller footprint allows it to run directly on edge devices without compromising too much on quality.
DeepSeek R1 has been quantized using various techniques (like the q4f16_ft version we'll be using) to further enhance its efficiency on GPU-accelerated platforms. These optimizations make it possible to deploy advanced AI capabilities in scenarios where cloud connectivity might be limited or where privacy and latency concerns necessitate local processing.
In this post, I'll share my journey setting up various AI tools on the Jetson Orin Nano Super, from basic system verification to running sophisticated language models.
Hardware Setup
- NVIDIA Jetson Orin Nano
- Power Cable
- NVMe SSD card
- Wireless Network USB
- Jumper wire(for initial setup)
- A Linux Laptop with SDK Manager
Flashing OS
You will need put the board into a recovery mode by using a jumper pin. On the Jetson Orin Nano, look for the FC REC pin and GND pin on the GPIO header.The FC REC pin is typically marked on the board (check the Jetson Orin Nano Developer Kit pinout diagram). Place the jumper across the FC REC and GND pins.This forces the board into Force Recovery Mode (RCM mode) when powered on. Power On the Board.
Connect the power supply while keeping the jumper in place. Remove the Jumper (Optional, if required). Connect the Jetson Orin Nano to a Linux PC via USB Type-C cable.
Start SDK Manager software on the Linux system and flash this board with JetPack 6.2. Choose NVMe SSD instead of SD card. Once the board has powered on, you can remove the jumper if necessary. Verify Recovery Mode on Host PC.
Verifying NVMe Storage
First, I checked if the OS was properly installed on the NVMe storage:
df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 456G 11G 422G 3% /
This confirmed that the system was running from the fast NVMe storage with plenty of space available (456GB total with 422GB free).
Installing Ollama
The first AI tool I installed was Ollama, which provides a convenient way to run large language models locally:
curl -fsSL https://ollama.com/install.sh | sh
The installation went smoothly, with the script automatically configuring Ollama for the NVIDIA Jetpack environment:
Installing ollama to /usr/local
Downloading Linux arm64 bundle
######################################################################## 100.0%
>>> Downloading JetPack 6 components
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA JetPack ready.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Setting Up OpenWebUI Using Docker
To provide a user-friendly interface for interacting with Ollama, I deployed OpenWebUI using Docker:
sudo docker run -d --network=host \
-v ${HOME}/open-webui:/app/backend/data \
-e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
I confirmed the container was running successfully:
sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c4de9820647e ghcr.io/open-webui/open-webui:main "bash start.sh" 11 seconds ago Up 10 seconds (health: starting) open-webui
Running DeepSeek R1 with Ollama
Next, I tested Ollama with a compact but capable language model, DeepSeek R1:
ollama run deepseek-r1:1.5b --verbose
The model download proceeded smoothly:
Copypulling manifest
pulling aabd4debf0c8... 100% ▕████████████████▏ 1.1 GB
pulling 369ca498f347... 100% ▕████████████████▏ 387 B
pulling 6e4c38e1172f... 0% ▕ ▏ 0 B
pulling f4d24e9138dd... 0% ▕ ▏ 0 B
pulling a85fe2a2e58e... 100% ▕████████████████▏ 487 B
verifying sha256 digest
writing manifest
success
Testing the Model with a Prompt
I tested the model with a practical prompt relevant to the Jetson platform:
Write a short Python script to capture and process images using OpenCV on Jetson
The model generated a detailed Python script with explanations of key components for image capture and processing on Jetson hardware. The response included proper imports, device number retrieval, camera configuration, image capture, and processing steps.
Analyzing Model Performance
After the completion of the response, Ollama provided performance metrics:
total duration: 30.039637931s
load duration: 60.895507ms
prompt eval count: 19 token(s)
prompt eval duration: 282ms
prompt eval rate: 67.38 tokens/s
eval count: 920 token(s)
eval duration: 29.693s
eval rate: 30.98 tokens/s
These metrics revealed:
- Total processing time: 30.04 seconds
- Model loading time: 60.9 milliseconds (very fast)
- Prompt evaluation: 19 tokens processed at 67.38 tokens/second
- Response generation: 920 tokens generated at 30.98 tokens/second
The performance is quite impressive for an edge device, with the model loading quickly and generating nearly 31 tokens per second.
Using Docker Compose for Advanced Deployment
For a more sophisticated setup, I used the following Docker Compose configuration to manage multiple services. I downloaded it from this link
I faced the following error message:
sudo docker compose up -d
[sudo] password for ajeetraina:
[+] Running 1/1
✔ llm-server Pulled 2.8s
[+] Running 0/1
⠸ Container llm_server Starting 0.3s
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'csv'
invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead.: unknown
This error indicated that the Jetson's Docker configuration was having issues with the newer deploy resources syntax for GPU access. The error specifically suggests using the older runtime: nvidia approach instead of the newer GPU device specification
Modified Docker Compose Configuration
I adjusted my approach to use the older runtime: nvidia syntax instead:
services:
llm-server:
stdin_open: true
tty: true
container_name: llm_server
network_mode: host
runtime: nvidia
ports:
- 9000:9000
environment:
- NVIDIA_VISIBLE_DEVICES=all
- DOCKER_PULL=always
- HF_HUB_CACHE=/root/.cache/huggingface
Important changes and optimizations include:
- Using runtime: nvidia instead of the newer deploy resources syntax
- Using host networking mode for optimal performance
- Volume mounts for caching model weights
- Health checks to ensure service availability
- Added
--prefill-chunk 2048
parameter to reduce memory usage
Deploying with Docker Compose
I launched the compose setup with:
sudo docker compose up -d
The output confirmed successful deployment:
[+] Running 1/1
✔ llm-server Pulled 2.8s
[+] Running 1/1
✔ Container llm_server Started 0.6s
Checking the container status:
sudo docker compose ps
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
llm_server dustynv/mlc:r36.4.0 "sudonim serve --mod…" llm-server 6 seconds ago Up 5 seconds (health: starting)
Memory Challenges on the Jetson Orin
When running the LLM server, I initially encountered memory limitations with the Jetson Orin Nano. The following error appeared in the logs:
[bt] (2) /usr/local/lib/python3.10/dist-packages/mlc_llm/libmlc_llm_module.so(+0x303cfc) [0xffff64853cfc]
[bt] (1) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x68) [0xffff898dc7f8]
[bt] (0) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff8b7ebfc0]
File "/opt/mlc-llm/cpp/serve/threaded_engine.cc", line 287
TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 6476.857 MB, which is less than the sum of model weight size (876.640 MB) and temporary buffer size (10771.183 MB).
1. You can set a larger "gpu_memory_utilization" value.
2. If the model weight size is too large, please enable tensor parallelism by passing `--tensor-parallel-shards $NGPU` to `mlc_llm gen_config` or use quantization.
3. If the temporary buffer size is too large, please use a smaller `--prefill-chunk-size` in `mlc_llm gen_config`.
This error message revealed the core challenge:
- Available GPU memory: ~6.5GB
- Model weight size: ~0.9GB
- Temporary buffer needed: ~10.8GB
- Total required: ~11.7GB (exceeding available memory)
The error suggested three potential solutions:
- Increase GPU memory utilization (not applicable as we were already pushing the limits)
- Enable tensor parallelism (not applicable for single-GPU setups like the Jetson)
- Reduce the prefill chunk size
I opted for the third solution by adding the --prefill-chunk 2048
parameter to the command, which reduced the temporary buffer size enough to fit within the available memory. This parameter essentially limits how much context the model processes at once, trading off some efficiency for reduced memory usage.
Let's fix this by following option #3 in the error message. I reduced the temporary buffer size by using a smaller prefill chunk size.
Here's how I modified Docker Compose file:
command: sudonim serve --model
dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization
q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host
0.0.0.0 --port 9000 --prefill-chunk 1024
Re-run the Compose Services:
sudo docker compose logs -f
llm_server |
llm_server | Running auto update command in /opt/sudonim
llm_server | git pull && pip3 install --upgrade-strategy only-if-needed -e .
llm_server | remote: Enumerating objects: 42, done.
remote: Counting objects: 100% (40/40), done.
remote: Compressing objects: 100% (7/7), done.
llm_server | remote: Total 24 (delta 19), reused 22 (delta 17), pack-reused 0 (from 0)
Unpacking objects: 100% (24/24), 2.46 KiB | 209.00 KiB/s, done.
llm_server | From https://github.com/dusty-nv/sudonim
llm_server | 7f568e6..c8d2f98 main -> origin/main
llm_server | Updating 7f568e6..c8d2f98
llm_server | Fast-forward
llm_server | pyproject.toml | 2 +-
llm_server | sudonim/runners/export.py | 28 +++++++++++++++++++++++++---
llm_server | sudonim/runtimes/mlc.py | 2 +-
llm_server | sudonim/utils/docker.py | 20 +++++++++++++++++---
llm_server | 4 files changed, 44 insertions(+), 8 deletions(-)
llm_server | Looking in indexes: https://pypi.jetson-ai-lab.dev/jp6/cu126
llm_server | Obtaining file:///opt/sudonim
So far, everything looks normal - it's pulling the latest code from the repository and updating the package.
face_hub->sudonim==0.1.7) (6.0.2)
llm_server | Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub->sudonim==0.1.7) (4.67.1)
llm_server | Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub->sudonim==0.1.7) (4.12.2)
llm_server | Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->docker->sudonim==0.1.7) (3.4.1)
llm_server | Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->docker->sudonim==0.1.7) (3.10)
llm_server | Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->docker->sudonim==0.1.7) (2024.12.14)
llm_server | Building wheels for collected packages: sudonim
llm_server | Building editable for sudonim (pyproject.toml) ... done
llm_server | Created wheel for sudonim: filename=sudonim-0.1.7-0.editable-py3-none-any.whl size=4464 sha256=dddf389242d52ea17d83e1fb4fa2a438b10fb22a7c913c4df4695b069c8764cb
llm_server | Stored in directory: /tmp/pip-ephem-wheel-cache-_jbwjsiw/wheels/4a/b4/2c/516abc57fcca2f71adffc99520e49ddd6c2ee2be8fd0dd86b6
llm_server | Successfully built sudonim
llm_server | Installing collected packages: sudonim
llm_server | Attempting uninstall: sudonim
llm_server | Found existing installation: sudonim 0.1.6
llm_server | Uninstalling sudonim-0.1.6:
llm_server | Successfully uninstalled sudonim-0.1.6
llm_server | Successfully installed sudonim-0.1.7
llm_server |
llm_server | [12:16:50] sudonim | sudonim version 0.1.7
llm_server |
llm_server | ┌──────────────────────────┬─────────────────────────────┬──────────────────────────────┐
llm_server | │ CUDA_VERSION 12.6 │ GPU 0 │ CACHE_ROOT /root/.cache │
llm_server | │ NVIDIA_DRIVER 540.4.0 │ ├ name Orin Nano 8GB │ HAS_MLC True │
llm_server | │ SYSTEM_ID orin-nano │ ├ family Ampere │ HAS_HF_HUB True │
llm_server | │ CPU_ARCH aarch64 │ ├ cores 1024 │ HAS_NVIDIA_SMI True │
llm_server | │ GPU_ARCH sm87 │ ├ mem_free [5.0 / 7.6 GB] │ │
llm_server | └──────────────────────────┴─────────────────────────────┴──────────────────────────────┘
llm_server |
llm_server | [12:16:50] sudonim | Downloading model from HF Hub: dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC -> /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
llm_server | /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
llm_server | warnings.warn(
Fetching 38 files: 100% 38/38 [00:00<00:00, 828.06it/s]
llm_server | [12:16:50] sudonim | Downloaded model dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC to: /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
llm_server | [12:16:50] sudonim | Loading model 'DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC' from /root/.cache/mlc_llm/dusty-nv
llm_server |
llm_server | mlc_llm serve --mode interactive --device cuda \
llm_server | --host 0.0.0.0 --port 9000 \
llm_server | --overrides='tensor_parallel_shards=1;prefill_chunk_size=1024' \
llm_server | --model-lib /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC/aarch64-cu126-sm87.so \
llm_server | DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
llm_server |
Here's the final working Compose file:
services:
llm-server:
stdin_open: true
tty: true
container_name: llm_server
network_mode: host
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- DOCKER_PULL=always
- HF_HUB_CACHE=/root/.cache/huggingface
pull_policy: always
volumes:
- /mnt/nvme/cache:/root/.cache
image: dustynv/mlc:r36.4.0
command: sudonim serve --model
dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization
q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host
0.0.0.0 --port 9000 --prefill-chunk 1024
healthcheck:
test: ["CMD", "curl", "-f", "http://0.0.0.0:9000/v1/models"]
interval: 20s
timeout: 60s
retries: 45
start_period: 15s
perf-bench:
profiles:
- perf-bench
depends_on:
llm-server:
condition: service_healthy
stdin_open: true
tty: true
network_mode: host
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /mnt/nvme/cache:/root/.cache
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- DOCKER_PULL=always
- HF_HUB_CACHE=/root/.cache/huggingface
pull_policy: always
image: dustynv/mlc:r36.4.0
command: sudonim bench stop --model
dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization
q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host
0.0.0.0 --port 9000 --prefill-chunk 1024
open-webui:
profiles:
- open-webui
depends_on:
llm-server:
condition: service_healthy
stdin_open: true
tty: true
container_name: open-webui
network_mode: host
environment:
- ENABLE_OPENAI_API=True
- ENABLE_OLLAMA_API=False
- OPENAI_API_BASE_URL=http://0.0.0.0:9000/v1
- OPENAI_API_KEY=foo
- DOCKER_PULL=always
- HF_HUB_CACHE=/root/.cache/huggingface
volumes:
- /mnt/nvme/cache/open-webui:/app/backend/data
- /mnt/nvme/cache:/root/.cache
pull_policy: always
image: ghcr.io/open-webui/open-webui:main
You're still encountering the same memory error. The --prefill-chunk 1024 parameter isn't reducing the memory requirements enough. Let's try an even smaller chunk size and look at other options:
First, modify your compose file to use an even smaller prefill chunk size:
command: sudonim serve --model
dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization
q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host
0.0.0.0 --port 9000 --prefill-chunk 512
After making these changes:
sudo docker compose down
sudo docker compose up -d
Conclusion
The NVIDIA Jetson Orin Nano Super proves to be an impressive platform for edge AI deployment. Despite some memory constraints that required careful optimization, I was able to successfully run sophisticated language models directly on the device.
The combination of Docker for containerization, Ollama for model management, and OpenWebUI for user interaction creates a powerful, self-contained AI development environment. This setup demonstrates the potential for deploying AI capabilities in scenarios where cloud connectivity might be limited or privacy concerns necessitate local processing.
For developers interested in edge AI, the Jetson Orin Nano Super represents an excellent balance of performance, power efficiency, and affordability. Its ability to run models like DeepSeek R1 with reasonable performance opens up exciting possibilities for creative applications in robotics, computer vision, natural language processing, and more.
Top comments (0)