Marko Arnauto for cortecs

Posted on Jan 13 • Edited on Jan 17

LLMs for Big Data

#nlp #llm #devops

We all love our chatbots, but when it comes to heavy-loads, they just don’t cut it. If you need to analyze thousands of documents at once, serverless inference — the go-to for chat applications — quickly shows its (rate) limits.

One Model — Many Users

Imagine working in a shared co-working space: it’s convenient, but your productivity depends on how crowded the space is. Similarly, serverless models like OpenAI, Anthropic or Groq rely on shared infrastructure, where performance fluctuates based on how many users are competing for resources. Strict rate limits, like Groq’s 7,000 tokens per minute, can grind progress to a halt.

Dedicated Compute — One Model per User

In contrast, dedicated inference allocates compute resources exclusively to a single user or application. This ensures predictable and consistent performance, as the only limiting factor is the computational capacity of the allocated GPUs. According to Fireworks.ai, a leading inference provider,

Graduating from serverless to on-demand deployments starts to make sense economically when you are running ~100k+ tokens per minute.

There are typically no rate limits on throughput. Billing for dedicated inference is time-based, calculated per hour or minute depending on the platform. While dedicated inference is well-suited for high-throughput, it involves a tedious setup process as well as the risk of overpaying due to idle times.

Tedious Setup

Deploying dedicated inference requires careful preparation. First, you need to rent suitable hardware to support your chosen model. Next, an inference engine such as vLLM must be configured to match the model’s requirements. Finally, secure access must be established via a TLS-encrypted connection to ensure encrypted communication. According to Philipp Schmidt, the co-founder of Hugging Face, you need one full-time developer to setup and maintain such a system.

Idle Times

Time-based billing makes cost-projections easier but on the other hand idle resources can quickly become a cost-overhead. Dedicated inference is cost-effective only when GPUs are busy. To avoid unnecessary expenses, the system should be turned off when not in use. Managing this manually can be tedious and error-prone.

LLM Workers to the Rescue

To address the downsides of dedicated inference, providers like Google, and Cortecs offer dedicated LLM workers.Without any additional configurations these workers are started and stopped on-demand — avoiding setup overhead and idle times. The required hardware is allocated, the inference engine is configured, and API connections are established all in the background. Once the workload is completed, the worker shuts down automatically.

Example

As I’m involved in the cortecs project I’m going to showcase it using our library. It can be installed with pip.

pip install cortecs-py

We will use the OpenAI python library to access the model.

pip install openai

Next, register at cortecs.ai and create your access credentials at the profile page. Then set them as environment variables.

export OPENAI_API_KEY="Your cortecs api key" export CORTECS_CLIENT_ID="Your cortecs id" export CORTECS_CLIENT_SECRET="Your cortecs secret"

It’s time to choose a model. We selected a model supporting 🔵 instant provisioning which was phi-4-FP8-Dynamic. Models that support instant provisioning enable a warm start, eliminating provisioning latency — perfect for this demonstration.

from openai import OpenAI
from cortecs_py import Cortecs

cortecs = Cortecs()
my_model = 'cortecs/phi-4-FP8-Dynamic'

# Start a new instance
my_instance = cortecs.ensure_instance(my_model)
client = OpenAI(base_url=my_instance.base_url)

completion = client.chat.completions.create(
  model=my_model,
  messages=[
    {"role": "user", "content": "Write a joke about LLMs."}
  ]
)
print(completion.choices[0].message.content)
# Stop the instance
cortecs.stop(my_instance.instance_id)

All provisioning complexity is abstracted by cortecs.ensure_instance(my_model) and cortecs.stop(my_instance.instance_id). Between these two lines, you can execute arbitrary inference tasks—whether it's generating a simple joke about LLMs or producing billions of words.

LLM Workers are a game-changer for large-scale data analysis. With no need to manage complex compute clusters, they enable seamless big data analysis and generation without the typical concerns of rate limits or exploding inference costs.
Imagine a future where LLM Workers handle highly complex tasks, such as proving mathematical theorems or executing reasoning-intensive operations. You could launch a worker, let it run at full GPU utilization to tackle the problem, and have it shut itself down automatically upon completion. The potential is enormous, and this tutorial demonstrates how to dynamically provision LLM Workers for high-performance AI tasks.

DEV Community

LLMs for Big Data

One Model — Many Users

Dedicated Compute — One Model per User

Tedious Setup

Idle Times

LLM Workers to the Rescue

Example

Top comments (0)

Read next

Amazon S3 vs. Glacier: Data Archival Explained

How RAG works? Retrieval Augmented Generation Explained

Create an agent and build a Notebook from it in watsonx.ai — Part 1

Building a RAG-Powered Support Chatbot in 24 Hours of Hackathon