Quick and Dirty Guide to Running a Local LLM and Making API Requests

#ai #llm #python #localmodel

Alright, buckle up because we’re diving into a quick and dirty solution for running a local LLM (Language Model) and making API requests — much like what the fancy commercial solutions do. Why? Well, why not? In just about three minutes, you can have a perfectly decent system running locally for most of your tests. And if you ever feel the need to scale up to the cloud again, switching back is practically effortless.

Here’s the documentation we’ll be following, mostly so you can claim you’ve read it:

OpenAI API docs

In particular, we’ll focus on making a request like this:

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
     "model": "gpt-4o-mini",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'

So far, so good, right? Nothing groundbreaking. But here’s where it gets fun…

Enter LM Studio

There's this gem of a tool called LM Studio, which makes local LLMs much easier to handle. After installing and running your model, you’ll notice a tab with a console icon called Developer. I know, it doesn’t sound too exciting at first, but hold on, because it gets better. This tab comes with a handy CURL example that shows you exactly how to use your local model. And, wouldn't you know it, it looks pretty familiar!

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-lexi-uncensored-v2",
    "messages": [
      { "role": "system", "content": "Always answer in rhymes. Today is Thursday" },
      { "role": "user", "content": "What day is it today?" }
    ],
    "temperature": 0.7,
    "max_tokens": -1,
    "stream": false
}'

Looks pretty familiar, right? This is the local version of what we just saw. You get the same setup as the OpenAI API request, except it’s running on your local machine. Plus, it's got a little flair — like the "Always answer in rhymes" system prompt. Poetry, anyone?

What About Python? We Got You.

If you prefer working with Python (and let’s be real, who doesn’t?), here’s how you’d send the same request using Python’s requests module:

import requests
import json

url = "http://localhost:1234/v1/chat/completions"

headers = {
    "Content-Type": "application/json"
}

data = {
    "model": "llama-3.1-8b-lexi-uncensored-v2",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "max_tokens": -1,
    "stream": False
}

response = requests.post(url, headers=headers, data=json.dumps(data))

if response.status_code == 200:
    result = response.json()
    print(result["choices"][0]["message"]["content"])
else:
    print(f"Error: {response.status_code}")

And voilà! You’re now ready to send requests to your local LLM just like you would with a commercial API. Go ahead, test it, break it, make it rhyme — the world (or at least your model) is your oyster.

DEV Community

Quick and Dirty Guide to Running a Local LLM and Making API Requests

Enter LM Studio

What About Python? We Got You.

Enjoy!

Top comments (0)

Read next

How to build an AI assistant with OpenAI, Vercel AI SDK, and Ollama with Next.js

Who’s Really Following You on Dev.to? A Guide to Analyzing Your Audience

How to create an image authentication system with python streamlit and canva!

How to deploy Granite MOE 1B and 3B in the Cloud?