DEV Community

Maxim Saplin
Maxim Saplin

Posted on

SGLang vs llama.cpp - A Quick Speed Test

Recently, I stumbled upon a post about SGLang, an open-source LLM inference engine that boasts 2-5x higher throughput compared to other solutions and a 1.76x speedup for DeepSeek R1 models!

Image description

"I'd be super happy even with a modest 1.5x speed-up over my LM Studio/llama.cpp setup!" was my first reaction...

A Closer Look

Just like llama.cpp, SGLang turned out to be a pretty low-level thing... I typically use LM Studio (and spent some with Ollama) for running models locally. They are very convenient and require minimal setup - in just minutes and a few clicks you can discover, download, and run models. Both provide an easy way to chat and can run local OpenAI Chat Completions endpoints, which is handy for integrating with various tools (e.g. using local Web UI or experimenting with AI agents).

SGLang is different, it was not created for LLM enthusiasts to run models on their home rigs. I started my research by looking for Ollama/Jan-like solutions, ideally with GUI (e.g. LM Studio) that could integrate SGLang as a runtime, but I didn't find any.

Hence I spent a couple of hours configuring my WSL2 and installing SGLang before I received my first generated tokens:

  • I didn't find explicit mention of supported platforms, seems like it's Linux only, I used WSL (Ubuntu 24) on Windows.
  • No chat UI (even through CLI), only OpenAI inference server
  • Supports downloading .safetensors models from Hugginface (though you need to configure huggingface-cli first to log in to get models like LLama or Gemma)
  • Besides the HF model format, it has some limited support for GGUF, i.e. the models you might have downloaded and used with llama.cpp can be used. For me Llama 3.1 8B was loaded, Gemma 2 9B failed to load (@q8 and Q4)
  • Supports on-line quantization upon loading a model
  • Tested it via a custom Web UI which I had to run separately. It has a tokens per second counter.

If you want to try SGLang by yourself, I've compiled my notes while setting it up and benchmarking here.

Results

I have tested inference speed with Gemma 2 9B. SGLang used a model from Google's HF hub in Safetensors format, LM Studio - GGUF model from LMStudio HF hub. Both models were tested in 16-bit and 8-bit variants. Both used CUDA backend, RTX 4090 100% GPU off-load.

Runtime Quantization VRAM Load Time Speed
SGLang fp8 21.1 GB 4-5 min ~70 tok/s
LM Studio Q8 12.6 GB ~10 sec ~65 tok/s
SGLang bf16 20.7 GB 4-5 min ~47 tok/s
LM Studio f16 20.7 GB ~20 sec ~44 tok/s

With SGLang there's a roughly 7% faster generation speed in tokens per second. Yet SGLang is super slow loading the models, taking minutes compared to seconds with llama.cpp. Besides there's some odd behavior in terms of VRAM consumption, loading the model in fp8 quantized format (doing online quantization) SGLang's memory consumption went up - loading larger models might be a challenge.

Sticking to Llama.cpp

IMO for the local LLM tinkering the marginal difference in generation speed is not worth the hassle - painful installation, troubled model discovery and downloading, longer load times, and odd VRAM consumption. Although SGLang might be a good option for multi-user production environments serving multiple requests at a time.

Top comments (0)