Maxim Saplin

Posted on Mar 11 • Edited on Oct 12

Running Local LLMs, CPU vs. GPU - a Quick Speed Test

#ai #llm #chatgpt #machinelearning

This is the 1st part of my investigations of local LLM inference speed. Here're the 2nd and 3rd ones

May 12 Update

Putting together a table with all the results from the comments. Putting at the top own measurements where I had control over the environment and have more confidence in measurement consistency (e.g. using the right model, similar size messages, ensuring settings consistency etc.).

Spec	Result
Apple M1 Pro CPU	14.8 tok/s
Apple M1 Pro GPU	19.4 tok/s
AMD Ryzen 7 7840U CPU	7.3 tok/s
AMD Radeon 780M iGPU	5.0 tok/s
AMD Ryzen 5 7535HS CPU	7.4 tok/s
GeForce RTX 4060 Mobile OC GPU	37.9 tok/s
GeForce RTX 4060 Mobile OC FA GPU	39.7 tok/s
GeForce RTX 4090 OC (+180 Core, +1500 Mem) GPU	108.5 tok/s
GeForce RTX 4090 OC FA (+180 Core, +1500 Mem) GPU	119.1 tok/s
--- Contributed by commenters ---	---
M3 Pro 12-core CPU 18GB CPU	17.9 tok/s
M3 Pro 12-core CPU 18GB GPU	21.1 tok/s
iPad Pro M1 256GB, using LLM Farm	12.1 tok/s
AMD Ryzen 7 7800x3d CPU	9.7 tok/s
Intel i7 14700k CPU	9.8 tok/s
ROG Ally Ryzen Z1 Extreme, 25W, CPU	5.3 tok/s
ROG Ally Ryzen Z1 Extreme, 15W, CPU	5.05 tok/s
GeForce RTX 4080 OC GPU	78.1 tok/s
Zotac Trinity non-OC 4080 Super GPU	71.6 tok/s
RTX 4070 TI Super GPU	62 tok/s
RTX 4070 Super GPU	58.2 tok/s
AMD 7900 XTX GPU	70.1 tok/s
AMD RX 6800XT 16GB GPU	52.9 tok/s
Razer Blade 2021, RTX 3070 TI GPU	41.8 tok/s
Razer Blade 2021, Ryzen 5900HX CPU	7.0 tok/s

A small observation, overclocking RTX 4060 and 4090 I noticed that LM Studio/llama.cpp doesn't benefit from core speeds yet gains from memory frequency.

Updated on March 14, more configs tested

Today, tools like LM Studio make it easy to find, download, and run large language models on consumer-grade hardware. A typical quantized 7B model (a model with 7 billion parameters which are squeezed into 8 bits each or even smaller) would require 4-7GB of RAM/VRAM which is something an average laptop has.

LM Studio allows you to pick whether to run the model using CPU and RAM or using GPU and VRAM. It also shows the tok/s metric at the bottom of the chat dialog

I have used this 5.94GB version of fine-tuned Mistral 7B and did a quick test of both options (CPU vs GPU) and here're the results. I have also added a few cases with Flash Attention (FA) enabled (added in recent versions of LM Studio under "Model initialisation" category).

Tokens/second

Spec	Result
Apple M1 Pro CPU	14.8 tok/s
Apple M1 Pro GPU	19.4 tok/s
AMD Ryzen 7 7840U CPU	7.3 tok/s
AMD Radeon 780M iGPU	5.0 tok/s
AMD Ryzen 5 7535HS CPU	7.4 tok/s
GeForce RTX 4060 Mobile OC GPU	37.9 tok/s
AMD Ryzen 7 7800x3d CPU	9.7 tok/s
GeForce RTX 4080 OC GPU	78.1 tok/s

Hardware Specs

2021 M1 Mac Book Pro, 10-core CPU(8 performance and 2 efficiency), 16-core iGPU, 16GB of RAM
2023 AOKZEO A1 Pro gaming handheld, AMD Ryzen 7 7840U CPU (8 cores, 16 threads), 32 GB LPDDR5X RAM, Radeon 780M iGPU (using system RAM as VRAM), TDP at 30W
- 3D Mark TimeSpy GPU Score 3000
- 3D Mark TimeSpy CPU Score 7300
2023 MSI Bravo C7VF-039XRU laptop, AMD Ryzen 5 7535HS CPU (6 cores, 12 threads, 54W), 16GB DDR RAM, GeForce RTX 4060 (8GB VRAM, 105W)
- GPU was slightly undervalued/overlocked, 3D Mark TimeSpy GPU Score 11300
- 3D Mark TimeSpy CPU Score 7600
Desktop PC, AMD Ryzen 7 7800x3d (8 cores 16 threads, 78w during test), 6200 DDR5, GeForce RTX 4080 16GB VRAM (slightly overclocked, 228w during test)

Screenshots

Mac

AOKZOE

MSI

Desktop PC

P.S>

It just hit me that while an average persons types 30~40 words per minute, RTX 4060 at 38 tokens/second (roughly 30 words per second) achieves 1800 WPM.

P.P.S>

Thanks to Sergey Zinchenko added the 4th config (
7800x3d + GeForce RTX 4080)

Top comments (40)

Maciej Wakuła • Mar 27

This depends much on the settings. I tried the same model and example query "tell me about Mars". Having Ryzen 3900 PRO CPU (12 cores, 24 threads, I got it for less than half price of 3900x), AMD RX 6700 (without x) which I also got cheap. RAM is pretty cheap as well so 128GB is in range of most. Using kobald-cpp rocm. With (14 layers on gpu, 14 cpu threads) it gave 6 tokens per second. (28,14) gave 15 T/s. (30,24) gave 4.43 T/s. Finally 35 layers, 24 CPU threads consumed total 7.3GB on GPU giving 34.61 T/s.

I'm writing to show that results depends very much on the settings.

Maxim Saplin • Mar 27

JIC, I tested pure cases, 100% CPU and 100% offloading to GPU

Orlando Arroyo • Apr 3

How did you get to use 100% of the CPU?, which config or settings did you have?

Maciej Wakuła • Apr 4 • Edited

You can offload all layers to GPU (CUDA, ROCm) or use CPU implementation (ex. HIPS). Just run LM Studio for your first steps. Run kobaldcpp or kobapldcpp-ROCm as second. Then try to use python and transformers. From there you should know enough about the basics to choose your directions. And remember that offloading all to GPU still consumes CPU

This is a peak when using full ROCm (GPU) offloading. See CPU usage on the left (initial CPU load is to start the tools, LLM was used on the peak at the end - there is GPU usage but also CPU used)

And this is windows - ROCm still is very limited on other operating systems :/

Orlando Arroyo • Mar 16 • Edited

Just for fun, here are some additional results:

iPad Pro M1 256GB, using LLM Farm to load the model: 12.05tok/s
Asus ROG Ally Z1 Extreme (CPU): 5.25 tok/s using the 25W preset, 5.05tok/s using the 15W preset

Update:
Asked a friend with a M3 Pro 12core CPU 18GB. Running from CPU: 17.93tok/s, GPU: 21.1tok/s

Maxim Saplin • Mar 16

The CPU result for ROG is close to the one from 7840U, after all they almost identical CPUs

clegger • Mar 16

The ROG Ally has a Ryzen Z1 Extreme which appears to be nearly identical to the 7840U, but from what I can discern, the NPU is disabled. So if / when LM Studio gets around to implementing support for that AI accelerator the 7840U should be faster at inferencing workloads.

Maxim Saplin • Mar 16

AMD GPU seems to be an underdog in the ML world, when compared to Nvidia... I doubt that AMD's NPU will see better compatibility with ML stack than it's GPUs

Ricardo Meleschi • Sep 30 • Edited

If you let me know what settings / template you used for this test, I'll run a similar test on my M4 iPad with 16GB Ram. I get wildly different tok/s depending on which LLM and which template I'm using now.

As of right now, with the fine-tuned LLM and the "TinyLLaMa 1B" template being used I get the following:

M4 iPad w 16GB Ram / 2TB Storage: 15.52t/s

Red Book • May 12 • Edited

I came across your benchmark. It's very useful. Here is a result from my machine:

Ryzen 5 7600 128GB + MSI RX 7900 XTX 70.1 tok/s

The total system power draw 478 watts, idle 95 watts.

using Mistral Orca Dpo V2 Instruct v0.2 Slerp 7B Q6_K

Best,

PS I've been thinking to get the M4 Pro 96GB when it's available, just to run 70B models.

This benchmark shows a difference.
twitter.com/ronaldmannak/status/17...

Bharath B • Mar 17

Intel i7 14700k - 9.82 token/s with no GPU offloading(peaked at 35% CPU usage in LMStudio. Guessing issue with multithreading)
Zotac Trinity non-OC 4080 Super - 71.61 tokens/s max GPU offloading

All numbers measured on non-overclocked factory default setup

Maxim Saplin • Mar 17

Thanks for sharing the numbers!

Orlando Arroyo • Mar 20

Indeed there’s something odd with the multithreading of the CPUs

Orlando Arroyo • Mar 15

Adding some info here:

Running on a Razer Blade 2021 with a Ryzen 5900HX, a GF 3070Ti and 16GB RAM, I got 41.75tok/s. I used the same test as you, asking about Mars on the same model.

Hope that adds information to this very interesting topic.

Maxim Saplin • Mar 15

Thanks for the contribution! I assume you used 100% GPU off-loading , right? Just checking:)

Orlando Arroyo • Mar 16

Indeed, 100%GPU off-loading.

I also tested an Ryzen 7950X with 0% off loading, but there’s something odd. I set 32 threads but CPU use is not going beyond 60% and only gets 7tok/s. Any thoughts how about possible cause?

Just for fun, I’ll check with an Asus ROG Ally later (Z1 Extreme version).

Maxim Saplin • Mar 16

Seems the threads param is ignored, I saw same behaviour when testing CPU inference

Orlando Arroyo • Apr 5

Just a quick update: using a RTX 4070 Super gets 58.2tok/s

Oleksandr Davyskyba • May 17

And RTX 4070 TI Super get 62tok/s

Maxim Saplin • May 12

Is that a desktop card?

Nicolay • Apr 30 • Edited

On my rtx 3050 the speed was 28.6 tok/s.
Based on the comments above, I made a table.

RTX 3050         8gb    28.6 tok/s
RTX 3070 TI     8gb    41,75
RTX 4060         8gb    37.9 tok/s
RTX 4070         12gb   58.2tok
RTX 4080         8gb     78.1

Maxim Saplin • May 4

Are all those videocards desktop ones?

Melroy van den Berg • May 30

Thank you for testing! Helped me a lot! AMD RX 7900 XTX is doing good..!

Melroy van den Berg • May 30

Anybody with an AMD W7900?

Oliver Stutz • Apr 30

78.51 tok/s with AMD 7900 XTX on RoCm Supported Version of LM Studio with llama 3
33 gpu layers (all while sharing the card with the screen rendering)