This is the 3rd part of my investigations of local LLM inference speed. Here're the 1st and 2nd ones
The speed of LLM inference is memory-bound. But what exactly does this mean? Is there a difference between standard JEDEC 4800MT/s and faster 6000MT/s XMP DDR5 sticks? Let's find out.
Test Environment
OS | Windows 11 23H2 (22631.4371) |
LLM Inference | LM Studio 0.3.4 (Build 3), when testing 100% CPU off-load 12 threads were used, when testing 100% GPU off-load Flash Attention is enabled |
CPU | Intel Core i5 13600KF overclocked (performance core multipliers 57x, 56x, 54x, 53x and 2 cores at 54x vs stock multipliers of 51x) |
RAM | DDR5 G.Skill 6000MT/s 36-36-36-96, 2x32GB and 2x16GB* |
Motherboard | Z790 PG Lightning |
GPU | RTX 4090 24GB VRAM, overclocked (+1440MHz mem frequency, +150MHz core) and power limited to 84% (~390W) |
*Made a few tests with 2x16GB and 2x32GB with a total of 96GB - due to CPU/MB limitations XMP frequencies were not achieved when all 4 slots were occupied. Max stable frequency was at 4800MT/s, timings 29-30-30-76. Most of the tests used 2x32GB config
Models
- Mistral 7B: 6 bit Q6_K, 5.94GB
mistral-7b-finetuned-orca-dpo-v2-Mistral-7B-Instruct-v0.2-slerp-GGUF
, used with 32K context - Llama 3.1 8B: 16 bit, 16.07GB,
meta-llama-3.1-8b-instruct.f16.gguf
, used with 32K context (instead of supported 128K) to avoid VRAM overflows when measuring GPU for comparison
Results
Bumping DDR5 speed from 4800MT/s to 6000MT/s brought +20.3% and +23.0% generation speedup (Mistral and Llama correspondingly).
Mistral 7B
DDR5 | TTFT (Cold), s | TTFT (Warm), s | TPS | READ, MB,s | WRITE, MB/s | COPY, MB/s | Latency, ns |
---|---|---|---|---|---|---|---|
4800 (4 sticks, 96GB) | 0,89 | 0,11 | 9,42 | 69019,00 | 68482,00 | 69815,67 | 76,93 |
4800 (2 sticks, 64GB) | 0,88 | 0,11 | 9,66 | 71032,67 | 71582,67 | 72058,00 | 77,70 |
6000 (2 sticks, 64GB) | 0,66 | 0,09 | 11,34 | 87342,67 | 85591,00 | 85535,33 | 70,43 |
6200 (2 sticks, 64GB) | 0,84 | 0,09 | 11,93 | 90268,00 | 88714,00 | 88178,67 | 68,57 |
Correl | 0,99600 | 0,99640 | 0,99644 | -0,98861 | |||
R^2 | 0,99202 | 0,99282 | 0,99290 | 0,97734 |
Llama 3.1
DDR5 | TTFT (Cold, s | TTFT (Warm, s | TPS | READ, MB,s | WRITE, MB/s | COPY, MB/s | Latency, ns |
---|---|---|---|---|---|---|---|
4800 (4 sticks, 96GB) | 2,46 | 0,30 | 3,86 | 69019,00 | 68482,00 | 69815,67 | 76,93 |
4800 (2 sticks, 64GB) | 2,38 | 0,26 | 4,00 | 71032,67 | 71582,67 | 72058,00 | 77,70 |
6000 (2 sticks, 64GB) | 2,78 | 0,22 | 4,74 | 87342,67 | 85591,00 | 85535,33 | 70,43 |
6200 (2 sticks, 64GB) | 2,73 | 0,21 | 4,87 | 90268,00 | 88714,00 | 88178,67 | 68,57 |
Correl | 0,99924 | 0,99969 | 0,99983 | -0,98161 | |||
R^2 | 0,99849 | 0,99939 | 0,99966 | 0,96356 |
- Faster DDR5 means faster generation speed
- There's a STRONG linear correlation between tokens per second and AIDA-reported memory speeds (in my case read, write, and copy speeds also correlated, hence the data can't say if the particular metric is more important)
Do Cores/Threads Matter
Not that much. You might be better off with fewer/slower cores yet faster memory:
Threads | TPS | |
---|---|---|
1 | 3,18 | |
2 | 5,46 | |
3 | 7,70 | 73,0% |
4 | 9,42 | |
5 | 10,3 | |
6 | *10,55 * | |
8 | 10,83 | |
10 | 11,04 | |
12 | 11,35 | 107,58% |
3 cores/threads demonstrated 73% or 6 cores/treads. 12 threads (those ones relied on hyper threading rather than on more physical cores) brought an additional 7.6% boost over 6 core baseline.
CPU via GPU
For reference here's the comparison of 6200MT/s CPU results to RTX 4090 GPU:
CPU TPS | GPU TPS | |
---|---|---|
Mistral 7B | 11,93 | 112,23 |
Llama 3.1 8B | 4,87 | 55,46 |
Approach, Notes
- After changing the memory config I ran AIDA Memory Tests 3 times and averaged them in the final table
- For each model I used the same dialog every time regenerating the last message "Tell me about Mars"
- Recorded 4 results for each model and averaged them
- TTFT Cold - time to first token during the first generation right after the model was loaded
- TTFT Warm - time to the first token in subsequent generations
- I actually did 2 measurements of Llama 3.1 at 6200 and got exhausted waiting for the results, anyways they almost didn't fluctuate The 4-stick configuration is slower than the 2-stick configuration even with the same speed and timings. Additionally, on consumer hardware, you are unlikely to get any speeds above 4800MT/s with 4 sticks due to MB and CPU memory controller limitations. Always try using 2 slots.
- 6200 was unstable OC, failed OCCT stress test
Top comments (5)
Well, there're just three steps remain to be done:
Jan.AI supports TensorRT, any benchmarks putting it up against llama.cpp?
jan.ai/post/benchmarking-nvidia-te...
Benchmarks looks impressive, installed Jan and TensorRT dependencies and dciscoverd that there's only a handful of prebuilt models available in Jan, there're also no Llama 3 models in there which was a surprise. Also tried how to get Nvidia's Nemotron 70B (supposedly should be easy) - failed after 10 minutes of research. With that kind of support of models it's no competitor to llama.cpp ecosystem :(
To be honest, haven't got a chance to play with it myself yet - just lost the access to some good rig with powerful Nvidia GPU :( I've read that modes can be somehow "compiled" or processed to be used with Tensor-RT. Of course the list of supported models is not that wide, compared to gguf, but imho it needs to be tested in real life and the speed bump worth it.
Will rent a machine with GPU to test it out myself.