Posted on Nov 18, 2024

Performance testing of OpenAI-compatible APIs (K6+Grafana)

#performance #api #openai #k6

I think many of you needed to profile performance of OpenAI-compatible APIs, and so did I. We had a project where I needed to compare scaling of Ollama compared to vLLM with high concurrent use (no surprises on the winner, but we wanted to measure the numbers in detail).

As a result, I ended up building an abstract setup for K6 and Grafana specifically for this purpose which I'm happy to share.

Here's how the end result looks like:

It's consists of a set of pre-configured components, as well as helpers to easily query the APIs, track completion request metrics and to create scenarios for permutation testing.

The setup is based on the following components:

K6 - modern and extremely flexible load testing tool
Grafana - for visualizing the results
InfluxDB - for storing and querying the results (non-persistent, but can be made so)

Most notably, the setup includes:

K6 helpers

If you worked with K6 before - you know that it's not JavaScript or Node.js, the whole HTTP stack is a wrapper around underlying Go backend (for efficiency and metric collection). So, the setup we built comes helpers to easily connect to the OpenAI-compatible APIs from the tests. For example:

const client = oai.createClient({
  // URL of the API, note that
  // "/v1" is added by the helper
  url: 'http://ollama:11434',
  options: {
    // a subset of the body of the request for /completions endpoints
    model: 'qwen2.5-coder:1.5b-base-q8_0',
  },
});

// /v1/completions endpoint
const response = client.complete({
  prompt: 'The meaning of life is',
  max_tokens: 10,
  // You can specify anything else supported by the
  // downstream service endpoint here, these
  // will override the "options" from the client as well.
});

// /v1/chat/completions endpoint
const response = client.chatComplete({
  messages: [
    { role: "user", content: "Answer in one word. Where is the moon?" },
  ],
  // You can specify anything else supported by the
  // downstream service endpoint here, these will
  // override the "options" from the client as well.
});

This client will also automatically collect a few metrics for all performed requests: prompt_tokens, completion_tokens, total_tokens, tokens_per_second (completion tokens per request duration). Of course, all of the native HTTP metrics from K6 are also there.

K6 sequence orchestration

When running performance tests - it's often about finding either a scalability limit or an optimal combination of parameters for projected scale, for example to find optimal temperature, max concurrency or any other dimension on the payloads for the downstream API.

So, the setup includes a permutation helper:

import * as oai from './helpers/openaiGeneric.js';
import { scenariosForVariations } from './helpers/utils.js';

// All possible parameters to permute
const variations = {
  temperature: [0, 0.5, 1],
  // Variants has to be serializable
  // Here, we're listing indices about
  // which client to use
  client: [0, 1],
  // Variations can be any set of discrete values
  animal: ['cats', 'dogs'],
}

// Clients to use in the tests, matching
// the indices from the variations above
const clients = [
  oai.createClient({
    url: 'http://ollama:11434',
    options: {
      model: 'qwen2.5-coder:1.5b-base-q8_0',
    },
  }),
  oai.createClient({
    url: 'http://vllm:11434',
    options: {
      model: 'Qwen/Qwen2.5-Coder-1.5B-Instruct-AWQ',
    },
  }),
]

export const options = {
  // Pre-configure a set of tests for all possible
  // permutations of the parameters
  scenarios: scenariosForVariations(variations, 60),
};

export default function () {
  // The actual test code, use variation parameters
  // from the __ENV
  const client = clients[__ENV.client];
  const animal = __ENV.animal;
  const response = client.complete({
    prompt: `I love ${animal} because`,
    max_tokens: 10,
    temperature: __ENV.temperature,
  });

  // ...
}

Grafana dashboard

To easily get the gist of the results - the setup includes a pre-configured Grafana dashboard. It's a simple one, but it's easy to extend and modify to your needs. Out of the box - you can see tokens per second (on per-request basis), completion and prompt token stats as well as metrics related to concurrency and the performance on the HTTP level.

Installation

The setup is a part of a larger project, but you can use it fully standalone. Please find the guide on GitHub.

DEV Community

Performance testing of OpenAI-compatible APIs (K6+Grafana)

Top comments (0)

Read next

Is OpenAI's o1 model a breakthrough or a bust?

Top 10 HTTP Testing Tools for Mac in 2025

How to Use Swagger UI Locally: A Step-by-Step Guide

Level Up Your App's Speed with `NgOptimizedImage`