DEV Community

Cover image for Performance testing of OpenAI-compatible APIs (K6+Grafana)
AV
AV

Posted on

Performance testing of OpenAI-compatible APIs (K6+Grafana)

I think many of you needed to profile performance of OpenAI-compatible APIs, and so did I. We had a project where I needed to compare scaling of Ollama compared to vLLM with high concurrent use (no surprises on the winner, but we wanted to measure the numbers in detail).

As a result, I ended up building an abstract setup for K6 and Grafana specifically for this purpose which I'm happy to share.

Here's how the end result looks like:

Screenshot of inference API performance in Grafana dashboard

It's consists of a set of pre-configured components, as well as helpers to easily query the APIs, track completion request metrics and to create scenarios for permutation testing.

The setup is based on the following components:

  • K6 - modern and extremely flexible load testing tool
  • Grafana - for visualizing the results
  • InfluxDB - for storing and querying the results (non-persistent, but can be made so)

Most notably, the setup includes:

K6 helpers

If you worked with K6 before - you know that it's not JavaScript or Node.js, the whole HTTP stack is a wrapper around underlying Go backend (for efficiency and metric collection). So, the setup we built comes helpers to easily connect to the OpenAI-compatible APIs from the tests. For example:

const client = oai.createClient({
  // URL of the API, note that
  // "/v1" is added by the helper
  url: 'http://ollama:11434',
  options: {
    // a subset of the body of the request for /completions endpoints
    model: 'qwen2.5-coder:1.5b-base-q8_0',
  },
});

// /v1/completions endpoint
const response = client.complete({
  prompt: 'The meaning of life is',
  max_tokens: 10,
  // You can specify anything else supported by the
  // downstream service endpoint here, these
  // will override the "options" from the client as well.
});

// /v1/chat/completions endpoint
const response = client.chatComplete({
  messages: [
    { role: "user", content: "Answer in one word. Where is the moon?" },
  ],
  // You can specify anything else supported by the
  // downstream service endpoint here, these will
  // override the "options" from the client as well.
});
Enter fullscreen mode Exit fullscreen mode

This client will also automatically collect a few metrics for all performed requests: prompt_tokens, completion_tokens, total_tokens, tokens_per_second (completion tokens per request duration). Of course, all of the native HTTP metrics from K6 are also there.

K6 sequence orchestration

When running performance tests - it's often about finding either a scalability limit or an optimal combination of parameters for projected scale, for example to find optimal temperature, max concurrency or any other dimension on the payloads for the downstream API.

So, the setup includes a permutation helper:

import * as oai from './helpers/openaiGeneric.js';
import { scenariosForVariations } from './helpers/utils.js';

// All possible parameters to permute
const variations = {
  temperature: [0, 0.5, 1],
  // Variants has to be serializable
  // Here, we're listing indices about
  // which client to use
  client: [0, 1],
  // Variations can be any set of discrete values
  animal: ['cats', 'dogs'],
}

// Clients to use in the tests, matching
// the indices from the variations above
const clients = [
  oai.createClient({
    url: 'http://ollama:11434',
    options: {
      model: 'qwen2.5-coder:1.5b-base-q8_0',
    },
  }),
  oai.createClient({
    url: 'http://vllm:11434',
    options: {
      model: 'Qwen/Qwen2.5-Coder-1.5B-Instruct-AWQ',
    },
  }),
]

export const options = {
  // Pre-configure a set of tests for all possible
  // permutations of the parameters
  scenarios: scenariosForVariations(variations, 60),
};

export default function () {
  // The actual test code, use variation parameters
  // from the __ENV
  const client = clients[__ENV.client];
  const animal = __ENV.animal;
  const response = client.complete({
    prompt: `I love ${animal} because`,
    max_tokens: 10,
    temperature: __ENV.temperature,
  });

  // ...
}
Enter fullscreen mode Exit fullscreen mode

Grafana dashboard

To easily get the gist of the results - the setup includes a pre-configured Grafana dashboard. It's a simple one, but it's easy to extend and modify to your needs. Out of the box - you can see tokens per second (on per-request basis), completion and prompt token stats as well as metrics related to concurrency and the performance on the HTTP level.

Installation

The setup is a part of a larger project, but you can use it fully standalone. Please find the guide on GitHub.

Top comments (0)