Intro
You might have heard that Large Language Models such as DeepSeek, Mistral and GPT are similar to text autocorrect mechanisms. This is not too far off, although they are of course infinitely more complex. Here's a basic condensation on how LLMs work:
1.0 N-Gram Language Models
These are simpler, primitive ancestors of LLMs. Examples are predictive text and plagiarism detectors.
- N-gram language models (predict next word by assigning probability to possible next word); context window is N-1 words.
- Bigram model: "Markov's Assumption"; predict next word based on last word only. Context window is 1 word.
- Trigram model: context window to predict next word is 2 words. N-gram: Limitations: in a 6-word model for instance, you have to type 5 exact words in the exact order that the model has a reference from for it to attempt a prediction of the 6th word. So in a terms and conditions model for instance, you would have to type "please read these terms and" for it to predict "conditions". For example, a plagiarism detection model would have a higher 'n' than a mobile phone's predictive text mechanism.
1.1 How do N-Grams Models Assign Probability?
- Bigram models will take an input text and divide it into word pairs.
- It will then check for pair frequency and assign probabilities based on it. For example: Input: 'The quick brown fox jumped over the lazy hog; the quick fox then disappeared.' Pair breakdown: ('the', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumped'), ('jumped', 'over'), ('over', 'the'), ('the', 'lazy'), ('lazy', 'hog'), ('hog', 'the'), ('the', 'quick'), ('quick', 'fox') ('fox', 'then'), ('then', 'disappeared') Pair frequency: ('the', 'quick') and ('quick', 'fox') both appear twice, all others appear once. So if you sent 'the' as an input to this model, it would predict 'quick'; sending 'quick' would predict 'fox'.
- This type of model has pretty obvious limitations, like not being able to infer meaning based on past input, or being able to assign probabilities to words that are not paired in a particular frequency.
2.0 Tokenization
- Tokens are smaller and more manageable units of text;
2.1 Whitespace Tokenization
- In the phrase "my robot ate my homework", the word "my" appears twice but an LLM would only store it once as a token in its vocabulary;
- This is how a text corpus is broken into units that a model can train on.
- Tokenization has many methods; the most obvious is whitespace tokenization, that reads every whitespace as a marker between different tokens.
2.2 Affix Tokenization
- Affixes are prefixes and suffixes, parts of words added to word stems in order to specify meaning.
- Ex: 'halt' is a stem, 'halting', 'halts' and 'halted' have more specific meanings derived from the stem.
- In the input "man, my model really sucked." we have the following tokenization: man|,| my | model | real|ly | suck|ed|.; it treats punctuation, word stems and affixes as individual tokens.
- This approach's main drawback is that it is grammar-dependent. For instance, English has more frequent grammatical functions on suffixes while Swahili has more on prefixes; Spanish is an analytical language in which meaning is specified through more use of words, while Hungarian is an agglutinative language in which meaning is specified through chaining of more affixes, etc.
2.3 Byte-Pair Encoding Tokenization
- This is the most similar approach to the n-gram model probability assignment, and yet it is the most complex.
- In this approach, tokens are not pre-identified and processed, they are built from the characters.
- This is how most modern (as of 2025) LLMs operate; by identifying common letters agglutinations then word radicals are formed, and then full words, and then related words in a context identified by combinations of other words.
The Byte-Pair Encoding Cycle
Quick Note: a token is an identified frequency of meaning, a character/letter is a character/letter. A character might be a token but a token not necessarily will be a character. A token corresponds roughly to three characters on average (according to what I've heard or read somewhere).
- 1) Algorithm splits words into characters, identify individual characters and then maps letter frequencies:
# input
"low lower lowest"
# letter split (notice spaces are considered letters)
l|o|w|(space)|l|o|w|e|r|(space)|l|o|w|e|s|t
# identification of individual tokens (also called the 'vocabulary'):
Initial_Vocabulary: l, o, w, e, r, s, t, (space)
# Most frequent letters in order
Token_Frequencies: l: 3, o: 3, w: 3, (space): 2, e: 2, r: 1, s: 1, t: 1
- 2) It will then count the frequencies of pairs of adjacent characters:
# Most frequent letter pairs in order
Paired_Tokens_Frequencies_0: lo: 3, ow: 3, w (space): 2, (space)l: 1, er: 1, es: 1, st: 1
- 3) Merge most frequent pair (lo) into a new token, remaking the vocabulary:
New_Vocabulary_0: lo, w, (space), e, r, s, t
Transformed_Input_0: "lo w lo wer lo west"
- 4) The process then continues recursively, the transformed input is used as new input and pairs are assigned again.
- Note how the pairs lo and ow have agglutinated into low; that's how LLMs rebuild language based on frequency of character agglutination.
Paired_Tokens_Frequencies_1: low: 2, w (space): 2, (space)lo: 1, wer: 1, west: 1
New_Vocabulary_1: low, er, est
Transformed_Input_1: "low low er low est"
- 5) After n merges, there will be a final vocabulary that might look like this:
Final_Vocabulary: l, o, w, e, r, s, t, low, er, est
- If the process kept going, eventually the vocabulary would become the words of the original text.
Advantages of Byte-Pair Encoding (BPE)
- Notice that the main advantages of Byte-Pair Encoding (BPE) tokenization is that it reduces vocabulary size compared to word and affix level tokenization and it also generalizes well to recognize otherwise unknown terms.
- Also notice that this strategy avoids helps to avoid bizarre, meaningless pairs such as 'lw'.
Note: needless to say, this is a very brief overview. Actual modern LLMs use strategies that, although parting from this principle, are a lot more sophisticated and involve a training complexity only available to the most resourceful institutions and brightest minds of the era. Strategies such as using bytes instead of characters and associating multiple tokens through complex networks of related nodes (see this article on neural networks) are closer to the actual methods being applied, and out of this humble article's scope.
Top comments (0)