DEV Community

Cover image for Getting Responses from Local LLM Models with Python
Luca Liu
Luca Liu

Posted on

Getting Responses from Local LLM Models with Python

Introduction

After setting up your local LLM model with LM Studio (as covered in my previous article), the next step is to interact with it programmatically using Python. This article will show you how to create a simple yet powerful Python interface for your local LLM.

Step 1: Start Your Local LLM System

Before running the Python code, ensure your local LLM system is up and running. Most systems expose a RESTful API or a similar interface for interaction.

For instance, LM Studio or similar tools may provide a local endpoint. You can find your local server address and Supported endpoints in the LM Studio interface.

As you can see, the local server address is http://127.0.0.1:1234.
the Supported endpoints are:

  1. GET http://localhost:1234/v1/models
  2. POST http://localhost:1234/v1/chat/completions
  3. POST http://localhost:1234/v1/completions
  4. POST http://localhost:1234/v1/embeddings

Step 2: List Available Models

The /v1/models endpoint retrieves the list of available models.
Here’s a basic Python script to send a prompt to your local LLM and receive a response.

import requests

LLM_BASE_URL = "http://localhost:1234" # replace with your server address

# Fetch available models
response = requests.get(f"{LLM_BASE_URL}/v1/models")
if response.status_code == 200:
    models = response.json()
    print("Available Models:", models)
else:
    print(f"Failed to fetch models: {response.status_code} - {response.text}")
Enter fullscreen mode Exit fullscreen mode

This will display the models hosted by LM Studio.

Step 3: Get Response from LLM

/v1/completions is for single prompts, while /v1/chat/completions is for conversations with context.

1. Generate a Completion

Use the /v1/completions endpoint to send a prompt and receive a response.

This endpoint generates a response to a simple text prompt. It’s straightforward and doesn’t involve a conversation context. Use this when you need a single, standalone output based on your input.

Example:

# Define the prompt and parameters
payload = {
    "model": "your-model-name",  # Replace with your desired model name
    "prompt": "What are the key benefits of local LLM systems?",
    "max_tokens": 100,
    "temperature": 0.7
}

# Send the request
response = requests.post(f"{LLM_BASE_URL}/v1/completions", json=payload)

if response.status_code == 200:
    data = response.json()
    print("Completion Response:")
    print(data.get("choices", [{}])[0].get("text", "No response"))
else:
    print(f"Error: {response.status_code} - {response.text}")
Enter fullscreen mode Exit fullscreen mode

Key Parameters:
• model: Specify the model to use.
• prompt: Your input to the model.
• max_tokens: Controls the maximum length of the response.
• temperature: Adjusts the randomness of the output.

2. Use the Chat Completion Endpoint

The /v1/chat/completions endpoint is ideal for conversation-like interactions.

This endpoint is for managing multi-turn conversations. It keeps track of context using roles (system, user, assistant). Use this when you need interactive, dynamic conversations with the model.

Example:

# Define the chat input
payload = {
    "model": "example-model",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is quantum computing?"},
        {"role": "assistant",
         "content": "Quantum computing is a type of computation that uses quantum-mechanical phenomena like superposition and entanglement to process information."},
        {"role": "user", "content": "Can you explain superposition in simple terms?"}
    ],
    "max_tokens": 100,
    "temperature": 0.5
}

# Send the request
response = requests.post(f"{LLM_BASE_URL}/v1/chat/completions", json=payload)

if response.status_code == 200:
    data = response.json()
    print("Chat Response:")
    print(data.get("choices", [{}])[0].get("message", {}).get("content", "No response"))
else:
    print(f"Error: {response.status_code} - {response.text}")
Enter fullscreen mode Exit fullscreen mode

When to Use:
• Interactive tasks: Provide back-and-forth dialogue with context awareness.
• Multi-step queries: Answer questions with follow-ups, maintaining the conversation thread.
• Context-sensitive tasks: Adjust responses based on prior inputs or the user’s context.

Conclusion

By following this guide, you can use Python to interact with your local LLM model. This is a simple and powerful way to integrate LLM into your applications.

Feel free to expand these scripts for more complex applications, such as automation or integration with other tools!


Explore more

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

🚀 Connect with me on LinkedIn

Top comments (0)