DEV Community

Cover image for DDR5 Speed, CPU and LLM Inference
Maxim Saplin
Maxim Saplin

Posted on

DDR5 Speed, CPU and LLM Inference

This is the 3rd part of my investigations of local LLM inference speed. Here're the 1st and 2nd ones

The speed of LLM inference is memory-bound. But what exactly does this mean? Is there a difference between standard JEDEC 4800MT/s and faster 6000MT/s XMP DDR5 sticks? Let's find out.

Test Environment

OS Windows 11 23H2 (22631.4371)
LLM Inference LM Studio 0.3.4 (Build 3), when testing 100% CPU off-load 12 threads were used, when testing 100% GPU off-load Flash Attention is enabled
CPU Intel Core i5 13600KF overclocked (performance core multipliers 57x, 56x, 54x, 53x and 2 cores at 54x vs stock multipliers of 51x)
RAM DDR5 G.Skill 6000MT/s 36-36-36-96, 2x32GB and 2x16GB*
Motherboard Z790 PG Lightning
GPU RTX 4090 24GB VRAM, overclocked (+1440MHz mem frequency, +150MHz core) and power limited to 84% (~390W)

*Made a few tests with 2x16GB and 2x32GB with a total of 96GB - due to CPU/MB limitations XMP frequencies were not achieved when all 4 slots were occupied. Max stable frequency was at 4800MT/s, timings 29-30-30-76. Most of the tests used 2x32GB config

Models

  • Mistral 7B: 6 bit Q6_K, 5.94GB mistral-7b-finetuned-orca-dpo-v2-Mistral-7B-Instruct-v0.2-slerp-GGUF, used with 32K context
  • Llama 3.1 8B: 16 bit, 16.07GB, meta-llama-3.1-8b-instruct.f16.gguf, used with 32K context (instead of supported 128K) to avoid VRAM overflows when measuring GPU for comparison

Results

Bumping DDR5 speed from 4800MT/s to 6000MT/s brought +20.3% and +23.0% generation speedup (Mistral and Llama correspondingly).

Mistral 7B

DDR5 TTFT (Cold), s TTFT (Warm), s TPS READ, MB,s WRITE, MB/s COPY, MB/s Latency, ns
4800 (4 sticks, 96GB) 0,89 0,11 9,42 69019,00 68482,00 69815,67 76,93
4800 (2 sticks, 64GB) 0,88 0,11 9,66 71032,67 71582,67 72058,00 77,70
6000 (2 sticks, 64GB) 0,66 0,09 11,34 87342,67 85591,00 85535,33 70,43
6200 (2 sticks, 64GB) 0,84 0,09 11,93 90268,00 88714,00 88178,67 68,57
Correl 0,99600 0,99640 0,99644 -0,98861
R^2 0,99202 0,99282 0,99290 0,97734

Llama 3.1

DDR5 TTFT (Cold, s TTFT (Warm, s TPS READ, MB,s WRITE, MB/s COPY, MB/s Latency, ns
4800 (4 sticks, 96GB) 2,46 0,30 3,86 69019,00 68482,00 69815,67 76,93
4800 (2 sticks, 64GB) 2,38 0,26 4,00 71032,67 71582,67 72058,00 77,70
6000 (2 sticks, 64GB) 2,78 0,22 4,74 87342,67 85591,00 85535,33 70,43
6200 (2 sticks, 64GB) 2,73 0,21 4,87 90268,00 88714,00 88178,67 68,57
Correl 0,99924 0,99969 0,99983 -0,98161
R^2 0,99849 0,99939 0,99966 0,96356
  • Faster DDR5 means faster generation speed
  • There's a STRONG linear correlation between tokens per second and AIDA-reported memory speeds (in my case read, write, and copy speeds also correlated, hence the data can't say if the particular metric is more important)

AIDA Memory Tests

Do Cores/Threads Matter

Not that much. You might be better off with fewer/slower cores yet faster memory:

Threads TPS
1 3,18
2 5,46
3 7,70 73,0%
4 9,42
5 10,3
6 *10,55 *
8 10,83
10 11,04
12 11,35 107,58%

3 cores/threads demonstrated 73% or 6 cores/treads. 12 threads (those ones relied on hyper threading rather than on more physical cores) brought an additional 7.6% boost over 6 core baseline.

CPU via GPU

For reference here's the comparison of 6200MT/s CPU results to RTX 4090 GPU:

CPU TPS GPU TPS
Mistral 7B 11,93 112,23
Llama 3.1 8B 4,87 55,46

Approach, Notes

  • After changing the memory config I ran AIDA Memory Tests 3 times and averaged them in the final table
  • For each model I used the same dialog every time regenerating the last message "Tell me about Mars"
  • Recorded 4 results for each model and averaged them
    • TTFT Cold - time to first token during the first generation right after the model was loaded
    • TTFT Warm - time to the first token in subsequent generations
    • I actually did 2 measurements of Llama 3.1 at 6200 and got exhausted waiting for the results, anyways they almost didn't fluctuate The 4-stick configuration is slower than the 2-stick configuration even with the same speed and timings. Additionally, on consumer hardware, you are unlikely to get any speeds above 4800MT/s with 4 sticks due to MB and CPU memory controller limitations. Always try using 2 slots.
  • 6200 was unstable OC, failed OCCT stress test

LM Studio Screenshot

Top comments (5)

Collapse
 
kha84 profile image
kha84

Well, there're just three steps remain to be done:

  • switch from Windows to Linux
  • abandon proprietary LMStudio in favor of free and open source Jan.AI (or cortex.cpp)
  • give it another try with Tensor-RT as an inference backend to get +50% TPS boost to your 4090 figures
Collapse
 
maximsaplin profile image
Maxim Saplin

Jan.AI supports TensorRT, any benchmarks putting it up against llama.cpp?

Collapse
 
kha84 profile image
kha84
Thread Thread
 
maximsaplin profile image
Maxim Saplin

Benchmarks looks impressive, installed Jan and TensorRT dependencies and dciscoverd that there's only a handful of prebuilt models available in Jan, there're also no Llama 3 models in there which was a surprise. Also tried how to get Nvidia's Nemotron 70B (supposedly should be easy) - failed after 10 minutes of research. With that kind of support of models it's no competitor to llama.cpp ecosystem :(

Thread Thread
 
kha84 profile image
kha84

To be honest, haven't got a chance to play with it myself yet - just lost the access to some good rig with powerful Nvidia GPU :( I've read that modes can be somehow "compiled" or processed to be used with Tensor-RT. Of course the list of supported models is not that wide, compared to gguf, but imho it needs to be tested in real life and the speed bump worth it.
Will rent a machine with GPU to test it out myself.