DEV Community

Cover image for Why you shouldn't Train your LLM from Scratch
Boris B.
Boris B.

Posted on • Originally published at Medium

Why you shouldn't Train your LLM from Scratch

Being the smart and curious person you are, you likely find the prospect of creating an LLM from scratch exciting.

Or at least, you're interested in knowing what it takes to create one from the ground up. That's completely understandable - who wouldn't?
However, you probably already know you can't but want to know regardless. To be blunt, it is impractical for most individuals and organisations to do this.

But knowledge is free, so let's see what it takes to build an LLM from scratch 😊.

Think It's Expensive? You Have No Idea

Let's use GPT-4 as an example since that's the AI model with the most public information on its associated training costs. It took 25,000 Nvidia A100 GPUs running for 90–100 days non-stop to train the model. Considering that each A100 GPU costs around $15K, the total GPU expense alone was about $375M.

To put that into perspective, these costs surpass the annual budgets of many mid-sized companies.

If buying the hardware seems too steep, renting might appear more accessible. However, renting A100 GPUs on cloud platforms like AWS costs about $3 per hour. Putting the cost for GPT-4s training at $180M , which is cheaper than buying the training hardware but not cheap either.

Similarly, LLama3 was trained on 24,000 H100 Nvidia GPU, meaning the estimated GPU training costs were $720m. These 2 examples give a good idea of the main cost when it comes to training.

Funny enough, when people consider the costs of training an LLM, they often focus solely on the above GPU expenses. Forgetting other less spoken-of costs like electricity, extra hardware, personnel costs etc.

But enough on the GPUs, let's now talk data.

Data - Feeding the Beast

My simple definition of an AI model that I have used dating pre-ChatGPT has always been that a model is an Algorithm combined with Data. The interesting thing about LLMs is that they take the data piece to a whole new level. We are talking hundreds of Gigabytes of text.

Artificial Intelligence Model = Algorithm + Data

Not only do you need large amounts of data to feed this beast, you need diversity as well. The data has to be varied enough to help the model understand language in all its forms.

Which means ingesting everything from classic literature, and code to the latest blog posts. A nice blend of wonderful Shakespearean writing and a bunch of people like myself on the internet with chaotic writing styles all in one mix. Plus, all of this has to be of high quality and representative of the world we live in today - languages, cultures, you name it.

Sticking with our GPT-4 example, the model was trained on about 10 trillion words. To give you an idea, it would take all Twitter users over 3.5 years to generate 10 trillion words at current rates.

That said, one could argue that more and more free datasets for training LLMs are becoming available (The Pile - 825GiB, Common Crawlt) making this "easier". True, but most of them still require extensive cleaning and formatting. Moreover, handling data on such a large scale requires robust infrastructure for storage and quick access during training.

Not Just Another Neural Network

There's a good reason why only a few people can DIY an LLM. In fact, Mistral AIᅳa French companyᅳmanaged to raise an astounding $113 million in seed funding without a product, simply by stating to investors that they had five employees with the expertise to create an LLM from scratch.

Photo by Clint Adair on Unsplash

Large Language Models use an advanced type of neural network called Transformers. These are especially efficient at producing models good at predicting the next word in a sentence, which is key for generating coherent text. While bigger isn't always better in most fields, it often is, with LLMs. And now the challenge becomes figuring out how to beat down your large collected dataset into this sophisticated algorithm ᅳthe process known as training.

Because training can be incredibly time-consuming, optimization becomes a must. This usually involves techniques like distributed training or parallelization to handle computations more efficiently, using mixed precision (32 & 64-bit) to reduce memory usage, and checkpointing to save your progress over time.

Not to bore you with the details, but my point here is that training an LLM is not for the weak. In fact, at times, a Python implementation of a transformer architecture can look less like Python and more like C.

class MultiHeadAttention(nn.Module):

    def __init__(self, d_model, n_head):
        super(MultiHeadAttention, self).__init__()
        self.n_head = n_head
        self.attention = ScaleDotProductAttention()
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_concat = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        q, k, v = self.w_q(q), self.w_k(k), self.w_v(v)
        q, k, v = self.split(q), self.split(k), self.split(v)
        out, attention = self.attention(q, k, v, mask=mask)
        out = self.concat(out)
        out = self.w_concat(out)
        return out

    def split(self, tensor):
        """
        split tensor by number of head

        :param tensor: [batch_size, length, d_model]
        :return: [batch_size, head, length, d_tensor]
        """
        batch_size, length, d_model = tensor.size()

        d_tensor = d_model // self.n_head
        tensor = tensor.view(batch_size, length, self.n_head, d_tensor).transpose(1, 2)
        return tensor

    def concat(self, tensor):
        """
        inverse function of self.split(tensor : torch.Tensor)

        :param tensor: [batch_size, head, length, d_tensor]
        :return: [batch_size, length, d_model]
        """
        batch_size, head, length, d_tensor = tensor.size()
        d_model = head * d_tensor

        tensor = tensor.transpose(1, 2).contiguous().view(batch_size, length, d_model)
        return tensor
Enter fullscreen mode Exit fullscreen mode

Beyond Training

After training, you need to make sure you have something useful. And fortunately or unfortunately the bar for a useful model is quite high these days.

Without proper evaluation, your model might spit out nonsense or even harmful content. The best way to assess it is by benchmarking against existing LLMs. But sometimes, only a human can catch the subtle details, so human evaluation is imperative at this stage.

If you're lucky enough to have a model that performs well, you can move on to post-training techniques like fine-tuning and prompt engineering. All methods that you might be more familiar with that allow you to adjust your model based on the evaluation results to improve its performance.
And if you're feeling up for it, why not release it to the world with built-in feedback loops to further refine your model over time?


⁤Still thinking about creating an LLM from scratch? ⁤⁤Go aheadᅳbe my guest!

Personally, I'll be sticking to finetuning or prompt engineering my way through existing Large Language Models. And though, a Data Scientist, ⁤⁤I'll only consider training Machine Learning models for specific use cases where LLMs are too costly or fail altogether.

Thanks for reading!
Like this story? Subscribe below, or Connect with me on LinkedIn & Twitter!

Top comments (0)