Posted on Nov 23

Llama 3.1 405B accelerated to almost a thousand tokens per second

Cerebras finally found enough of their CS-3 to launch Llama 405B, applied Speculative Decoding to it, which they used to speed up 70B up to 2k tokens, and outperformed SambaNova by almost 6 times. It will cost $6 input/$12 output per million tokens and is already available in beta. All users will be given access in the first quarter of 2025.

You have to wait so long because of the extremely poor availability of hardware - in order to run Llama 405B, you need 20-30 CS-3. By comparison, Condor Galaxy, a supercomputer powered by Cerebras chips, has only 64 CS-3s. And it costs more than one hundred million dollars. I hope that if they manage to switch to mass production, the cost of their systems will drop significantly. Otherwise, the profitability of such an API is questionable.

It’s not just Cerebras that has problems with availability—Groq also has them, which have been promising API 405B for more than three months, but apparently there just aren’t enough chips (about four thousand Groq chips are needed to run 405B). In the meantime, they have almost caught up with Cerebras on the Llama 70B inference - 1669 tokens per second, while promising that the next generation of chips will be much faster.

Unfortunately, access to all users via chat was not given this time. And the context length is only 8k so far, but they promise to make 128k available at release. The speed in this context, however, sags, but still more than half a thousand tokens per second. Hopefully for a full release R1 they will dig up another supercomputer, and we will have a model that thinks in seconds instead of minutes.

DEV Community

Llama 3.1 405B accelerated to almost a thousand tokens per second

Top comments (0)

Read next

Introduction to spaCy: A Powerful NLP Library

What Are Large Language Models (LLMs) and How Are They Used?

Clarifying some Misconceptions about Java

Allwinner T527 SystemOnModule Adapts for Android 13