Ever wondered what it would be like to have your own personal fleet of language models at your command? In this post, we'll explore how to run LLMs on cloud GPU instances, giving you more control, better performance, and greater flexibility.
Why Run Your Own LLM Instances?
There are several compelling reasons to consider this approach:
- Data Control: You have complete oversight of the data sent to and processed by the LLMs.
- Enhanced Performance: Access to powerful GPU instances means faster responses and the ability to run larger models.
- Model Ownership: Run fine-tuned models with behaviour that remains consistent over time.
- Scalability: Easily scale resources up or down based on your needs.
How does it work?
We will be using Ollama, a tool for running LLMs, along with cost-effective cloud providers like RunPod and vast.ai. Here's the basic process:
- Start a cloud instance with Ollama installed
- Serve the LLM through an API
- Access the API from your local machine
RunPod
RunPod (runpod.io) offers a streamlined approach to creating cloud instances from Docker images. This means you can quickly spin up an instance that's already configured to serve Ollama and provide API access. It's worth noting that their pricing has become more competitive recently, with instances starting at $0.22/hr for a 24GB VRAM GPU.
Here's a step-by-step guide to get you started:
- On runpod.io, navigate to "Pods"
- Click "Deploy" and select an NVIDIA instance
- Choose the "ollama template" based on ollama/ollama:latest Docker image
- Take note of the POD_ID - you'll need this for API access
- Connect to the API via HTTPS on {POD_ID}-11434.proxy.runpod.net:443
While this setup is straightforward, it does raise some security concerns. The API is exposed on port 11434 without any built-in authentication or access limitations. I attempted to use an SSH tunnel as a workaround (similar to the method I'll describe for vast.ai), but encountered difficulties getting it to work with RunPod. This is an area where I'd appreciate community input on best practices or alternative solutions.
Optional: SSH access
If you need direct access to the instance, SSH is available:
ssh tn0b2n8qpybgbv-644112be@ssh.runpod.io -i ~/.ssh/id_ed25519
Once connected, you can verify the GPU specifications using the nvidia-smi command.
Vast.ai
vast.ai operates as a marketplace where users can both offer and rent GPU instances. The pricing is generally quite competitive, often lower than RunPod, especially for low-end GPUs with less than 24GB of VRAM. However, it also provides access to more powerful systems, like the 4xA100 setup I used to run Llama3.1-405B.
Setting up an instance on Vast.ai is straightforward. You can select a template for Ollama within their interface, leveraging Docker once again. Unlike RunPod, Vast.ai doesn’t automatically expose a port for API access, but I found that using SSH tunneling is a more secure and preferred solution. Once you’ve chosen an instance that meets your requirements, simply click on “Rent” and connect to the instance via SSH, which also sets up the SSH tunnel.
ssh -i ~/.ssh/vastai -p 31644 root@162.193.169.187 -L 11434:localhost:11434
This command creates a tunnel that forwards connections from port 11434 on your local machine to port 11434 on the remote machine, allowing you to access services on the remote machine as if they were running locally.
Important the vast.ai image does not run the Ollama server by default. To enable this, you need to modify the template during the instance rental process by adding ollama serve
to the on-start script. Alternatively, you can connect via SSH and manually run the command. Additionally, Vast.ai offers a CLI tool to search for available GPU instances, rent them, run Ollama, and connect directly via the CLI, which is quite neat.
Note: During my initial tests on Vast.ai, I encountered issues where the Ollama server crashed, likely due to instance-specific factors. Restarting the instance resolved the problem, suggesting it might have been an isolated incident.
Checking the API and comparing models
I've created some scripts to test models and compare performance (see here). Here's how to use them.
For RunPod:
node stream_chat_completion.js -v --function ollama --hostname sbeu57aj70rdqu-11434.proxy.runpod.net --port 443
For Vast.ai (using SSH tunnel, keep that terminal open!):
node stream_chat_completion.js -v --function ollama --models mistral-nemo:12b-instruct-2407-q2_K,mistral-nemo:12b-instruct-2407-q4_K_M
What About AWS?
While I initially looked into AWS EC2, it proved less straightforward and more costly for this specific use case compared to RunPod and Vast.ai. For completeness, the steps I took to setup nvidia-smi and ollama:
sudo apt-get update
sudo apt install ubuntu-drivers-common -y
sudo apt install nvidia-driver-550 -y # use 535 for A10G!!
sudo apt install nvidia-cuda-toolkit -y
# verify that nvidia drivers are running
sudo nvidia-smi
# install ollama
curl -fsSL https://ollama.com/install.sh | sh
Conclusion
Running LLMs on cloud GPU instances is more accessible (both from a cost and an effort perspective) than I originally thought, and it offers impressive performance for various model sizes. The ability to run large models like Llama3.1-405B, quantised to fit on 320GB VRAM is particularly noteworthy.
However, I'm not yet sure what would be a good use case compared to using big LLMs available through APIs (e.g. GPT-4o, Claude 3.5, etc.), besides testing bigger models.
Have you tried running your own LLMs in the cloud? What has your experience been like? I'd love to hear your thoughts and questions in the comments below!
Top comments (0)