Yuval

Posted on Jan 24 • Originally published at thecodingnotebook.com

Notes from course: Generative AI with Large Language Models - Week 1

#llm #genai

Below are my notes from the course "Generative AI with Large Language Models" offered by deeplearning.ai

Disclaimer: All the images in this post are of deeplearning.ai

Transformers

The transformer architecture introduced a breakthrough in language models, before transformers language models (like RNN) used to process data sequentially and could predict the next word based on the previously processed data only, having a limit on the range of data it could hold the model lost context from the older data.
Transformers on the other hand process all elements in a sequence simultaneously. This is made possible by the self-attention mechanism, which enables Transformers to capture long-range dependencies more effectively than RNNs. This is because the network can directly access information from any other element in the sequence, regardless of its position.

Transformers models can learn the relationships between all words:

Transformers models also assign weights to each relationship, learning how words are related to each other, here is an example of such "attention map":

In this example the word "book" is strongly related to the words "teacher" and "student".

Generating text

The original transformer model was an encoder-decoder sequence-to-sequence model, used for text translation.
Here is a general overview:

The input sequence (French) is tokenized (converting words to numbers)
The tokens are fed into an Embedding layer, where each token is mapped to an embedding vector that represent the meaning of the word.
Not shown here, but a position vector, representing the word position in the sequence is added to the embedding vector.
The combined vector (for each word) is fed to the encoder where the self-attention layers are. There are several attention layers (multi-headed attention), each attention layer learn its own attention map focusing on different words of the sentence.
Text generation happens at the decoder, a special <START> token is fed into the model, the decoder, using the context understanding passed from the encoder + its own multi-headed self-attention layers, generates the next token, this next token is then fed to the decoder to generate the next token, and so on until a special <END> token is predicted.
Note that the decoder output to the softmax layer is actually a vector the holds the entire tokens of the model corpus with a probability for each word. There are difference strategies to pick the "correct" token from this vector.
Finally the predicted tokens are converted back to words and we get the result "I love machine learning".

Generative configuration

As explained above, there are different strategies to pick the generated token in the softmax layer.

Greedy - The token with the highest probability is selected. This method is susceptible to repeated words selection and less natural text.
Random-weighted sampling - Here the probability is determined by a weight associated with each item, it's like a biased coin toss, where some outcomes are more likely than others. In this way we will not always choose the same word that has the highest probability but we also give a chance to other (high probability) words to get selected.

We can control the sampling process by specifying top-k and top-p:

top-k - selects the k most probable tokens from the model's output and samples from that reduced set
top-p - selects the smallest set of tokens whose cumulative probability exceeds a threshold p. This method is more flexible than top-k, as it dynamically adjusts the number of tokens based on their probabilities.

In practice, top-p and top-k can be used together or independently. For example, you could use top-k to select the top 50 tokens, and then use top-p to further refine that set based on a cumulative probability threshold.

We can also adjust the probabilities of each token before doing the sampling, we can use the temperature parameter to scale the probabilities.
A value lower than 1 will over scale the higher probabilities, creating a skew towards the higher probabilities tokens and less random selection.
A value higher than 1 will "equalize" the probabilities providing higher chance for lower probabilities tokens to get selected.

Generative AI Project Lifecycle

Define use case

The most important step in any project is to define the scope as accurately and narrowly as you can (text generation, classification etc.).

Select

There are plenty of foundation models to choose from, select the model that best fit the project needs.

Each model usually comes with a "model card" that explains the model use cases, how it was trained, biases and risks etc.

Adapt and align model

First develop an evaluation framework, then start an iterative process in which you try to get the desired outcome by improving the prompt (multi-shot), if that's not working try fine-tuning or RLHF

Application integration

At this stage, an important step is to optimize your model for deployment. This can ensure that you're making the best use of your compute resources and providing the best possible experience for the users of your application

How LLM are trained

Model Types

There 3 types of LLMs, encoder-only, encoder-decoder, decoder-only.

Encoder Only (autoencoding)

They are pre-trained using masked language modeling. Here, tokens in the input sequence or randomly mask, and the training objective is to predict the mask tokens in order to reconstruct the original sentence.
Autoencoding models spilled bi-directional representations of the input sequence, meaning that the model has an understanding of the full context of a token and not just of the words that come before.

These models are ideally suited to task that benefit from this bi-directional contexts like sentiment analysis, named entity recognition and word classification.
BERT is a famous encoder only model.

Decoder Only (autoregressive)

Here, the training objective is to predict the next token based on the previous sequence of tokens.
Decoder-based autoregressive models, mask the input sequence and can only see the input tokens leading up to the token in question. The model has no knowledge of the end of the sentence. The model then iterates over the input sequence one by one to predict the following token. In contrast to the encoder architecture, this means that the context is unidirectional. By learning to predict the next token from a vast number of examples, the model builds up a statistical representation of language.

These models are ideal for text generation.
GPT is a decoder only model.

Encoder-Decoder (sequence-to-sequence)

The exact details of the pre-training objective vary from model to model. A popular sequence-to-sequence model T5, pre-trains the encoder using span corruption, which masks random sequences of input tokens, replacing them with a special token (e.g <X>), called sentinel token.
The decoder is then tasked with reconstructing the mask token sequences auto-regressively. The output is the Sentinel token followed by the predicted tokens.
They are generally useful in cases where you have a body of texts as both input and output.

These models are ideally suited for translation, text summarization and question answering.
T5 and BART is a famous encoder-decoder model.

Summary

Researchers found that the larger the model the better its performance, which led to larger and larger models, how large can they grow??

Computational Challenges

In order to train a 1B parameters model a GPU with 24GB RAM is needed (each parameter is 32-bit floating point -> 4GB for model weights only, and additional 20GB are needed for the optimizers, gradients etc.).

Now imagine the amount of RAM needed to train a 500B parameters models!

Quantization

Instead of using 32-bit floating point (FP32) a 16-bit floating point can be used (FP16) or even an 8-bit integer. This of course will come on the expense of accuracy.

Google has developed a new datatype, BFLOAT16 (BF16), it is a 16-bit floating point but it allocates more bits to represent the exponent on the expense of the fraction. This allows the datatype to have the same range like FP32 just with a lower fraction accuracy.

Distributed Data Parallel (multi-GPU training)

In order to speed up training a multi-GPU training is possible. In this case a data-loader is sending different chunks of training data to the the available GPUs, each GPU performs a partial gradient update and then a synchronizer synchronize all gradients and update the model weights.
Important: In this method each GPU must hold the entire model weights + gradients + optimizer in its memory, if the model is too big it's not possible to do that.

Fully Sharded Data Parallel

In this method the model's weights + gradients + optimizer are sharded and distributed among the GPUs, this way a large model can be trained.

Scaling Laws

In order to improve the model performance (reduce loss), there are 3 factors that can be tweaked, increasing any one of them can improve the model performance (reduce its loss further during training):
1) Compute - Number of 1 peta floating-point ops/sec per day (1 petaflops/s-day)
2) Dataset Size - Amount of training data.
3) Model Size - Number of model parameters.

More peta flops/sec-day leads to lower loss

Larger datasets / parameters leads to lower loss.

A research paper that studied the relationship between the three came up with a formula on how to train the optimal model, named "Chinchilla", they found the perfect balance between compute/dataset/parameters.
Their finding was that many large models are over-parameterized and under-trained (not enough training data).
They found that in order to achieve a compute optimal training the number of training tokens should be x20 the number of model parameters.

The Chinchilla and LLaMA models are compute-optimal where GPT-3 does not have enough training data (hence need more compute). This is how LLaMA can perform as good as GPT-3 although being smaller.

The bottom line, given the Chinchilla model guidelines, given a compute budget one can guess how much training data is needed per model size.

BloombergGPT

There are specialized domains where the pre-trained models are not good enough and it is necessary to pre-train a model from scratch, for example for medicine, law or finance.

The Bloomberg team deiced to train a financial model from scratch, they split the training data between financial and general (so the model is capable both of financial tasks and general purpose language tasks).

The team had a budget constraint on compute so they chose the model size carefully.

The dashed vertical line is the compute budget, the pink shaded area is the optimal number of parameters/training tokens per FLOPs (compute).
The BloombergGPT is slightly above optimal with respect to number of parameters, and slightly below with respect to training data.

DEV Community