Param Ahuja

Posted on Feb 7

How Positional Encoding & Multi-Head Attention Powers Transformers?

#gpt3 #ai #machinelearning

Remember those jumbled sentences from school that you had to unscramble? You’d have a set of words in random order and your task would be to rearrange them to make a meaningful sentence. Now, imagine if you had to do that with an entire book. This is the challenge that Positional Encoding aims to solve in Transformer models!

This is part two of the multi-post series on Transformers. In the previous one, we had a look at the self-attention mechanism. In this blog, we’ll break down how Positional Encoding works in self-attention block.

Furthermore, even though self-attention mechanism in transformers helps to capture the context of words, simply knowing the context isn’t enough either, some words have multiple meanings, some sentences contain ambiguous references, and others rely on long-range dependencies that traditional self-attention models struggle to capture. This is where Multi-Head Attention and come in.

Later on in this blog, we would also take a deep dive into how Multi-Head Attention helps to unlock the true potential of Transformers.

Introduction

As an AI enthusiast diving into the fascinating realm of GenerativeAI, you’ve likely wondered at some point how modern large language models (LLMs) like GPT understand intricate meanings of the prompts.

We know that the answer lies in the self-attention block of transformers, that forms the base for all modern large language models (LLMs) like GPT, BERT, and T5.

Before Transformers, earlier models like RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks), struggled to maintain context over long range dependencies and were computationally inefficient, as they processed each word sequentially. And while RNNs and LSTMs did their best, they couldn’t fully grasp the complexity of languages.

The transformers solved the problems by parallelizing the process using self-attention, but doing so introduced a unique and critical problem. The positional information was lost along the process due to the process being completely independent for each word.

The Context vectors for each word could now be easily generated but they were jumbled. This is where Positional Encodings come into play!

Positional Encodings: Giving Order to the Words

Method 1: Why Not Just Use Integer Indices?

One could simply capture word positions by appending their position in the phrase, to the corresponding embedding.

For example: In the sentence: “Tom chases Jerry”

If word embeddings of the words “Tom”, “chases” and “Jerry” are Et, Ec and Ej respectively, then

`“Tom” → concat( Et + 1 )

“chases” → concat( Ec + 2 )

“Jerry” → concat( Ej + 3 )]
`
where Et, Ec and Ej could look like:

“Tom” → [-0.82802757 -0.96047069 -0.41985336 -0.62008336 0.58877143 -1.08165606 0.04304579 0.00271311]

“chases” → [ 1.08075773 -0.37670473 0.32323258 -1.0866603 -0.045058 1.16836346 -0.17799415 -0.58963795]

“Jerry” → [-1.20919747 -0.2386588 0.88268251 -0.74223663 -0.9577803 -0.50713604 -0.01114394 0.6699456 ]

But this method has its own limitations:

1. Unbounded Integers: **Large integer values would dominate the embedding space, making it difficult for the model to balance learning between token meaning and positional information.
**2. Discrete Nature: Integer positions are discrete, leading to poor gradient-based learning for training models.
3. Relative Positioning: Integer positions don’t capture the relative distances between tokens, which are crucial for understanding context.

Method 2: Sinusoidal Positional Encoding.

To overcome these limitations, sinusoidal waves could be used, as they are bounded, continuous and could be used for locating relative positioning.

In this method, the integral indices of the words are passed through a sine function to find the new positional encoding of the words.

For example: : In the sentence — “Tom chases Jerry”

`“Tom” → concat( Et + sin(1) )

“chases” → concat( Ec + sin(2) )

“Jerry” → concat( Ej + sin(3) )]`

This method isn’t perfectly clean either, there are two problems in this method:

Problem 1: Periodicity of the function

The one obvious problem is that sine function is also periodic, hence, the positional encodings would repeat after each cycle.

This could easily be solved by representing the position of each word with a combination of multiple sine and cosine functions instead of a single sine wave, all appended at the end of the corresponding word embeddings, hence increasing the periodicity of the model.

Hence, we would have multiple pairs of sine and cosine functions, with each pair having decreasing frequency. Keep in mind that the function would still be periodic for a long range.

The formula for deciding the number of pairs of sine and cosine functions for positional encoding is:

Where,

pos is the position of the word, i is the dimension index, and d_model is the embedding size (usually 512).

The periodic nature of sine and cosine ensures that the model can distinguish between positions, while keeping the distances meaningful.

Problem 2: Adding vs. Concatenating Positional Encodings

Another problem is that if we simply concatenate the positional encodings to the word embeddings, the order of the embeddings would change, becoming incompatible in the self-attention block.

Hence, rather than concatenating positional encodings with word embeddings, they are added to the embeddings. This addition ensures that each word retains both its semantic meaning (from embeddings) and positional information (from positional encodings). This helps the model understand both the context and position of words in a sentence.

The matrix of all the positional encodings, would look something like this:

Hence, this would solve all the problems in capturing the positional information for any sequence.

The following visualization compares the word embeddings before and after capturing the positional encodings. This looks so cool!

Using the component of Positional encoding, Transformers can now differentiate between positions while maintaining relationships between words.

Lets now see how Multi-Head Attention helps to unlock the true potential of Transformers.

Why Single-Process Self-Attention Won’t Suffice

For complex sentences, a single self-attention process — where each word has a single contextual embedding — simply doesn’t capture the full range of meaning. This is especially true when sentences are ambiguous, involve multiple meanings for a single word, or contain long-range dependencies.

Examples:

Polysemy: “Bank” could mean a financial institution or a riverbank.
Ambiguous Pronouns: “John gave Mark his book.” — “his” could refer to John or Mark.
Long-Range Dependencies: “The dog that chased the cat ran away.” — “dog” and “ran” are far apart but related.
Negation: “I didn’t see John at the park.” — Negation flips the meaning.
Implicit Meanings: “He didn’t study, so he failed.” — The cause-effect relationship is unstated but understood.

How Multi-Head Attention Works: Capturing Multiple Contexts

To solve this issue, Transformers use multiple self-attention blocks for each word — each with its own Key, Query, and Value vectors — hence capturing different contexts in the same set of words, allowing it to handle complex linguistic phenomena effectively.

How It Rolls:

Step 1: The input sequence starts with word embeddings of the order 1x51 (in the original paper) representing subspaces of the 512-dimensional embedding for each word.

Step 2: Transformers use 8 attention heads, instead of one. This means that each head creates its own context vector using weighted matrices (Wq, Wk and Wv) of sizes 512x64. Hence creating 24 total weighted matrices.

Step 3: The Matrices (Q, K, V) of order 1*512 are made for each word, that would transform the embeddings into context vectors that reflect different relationships between words.

Step 4: Each word is passed through the self-attention blocks, and the context vectors are concatenated and passed through a linear transformation matrix (called W₀ matrix) of order (512x512) to rank their importance.

Final Output: The final output is a nx512 matrix, representing the context of n words in the sequence.

For visualization of the attention heads and how each word attends to others, tools like BertViz could be used, showing how the model focuses on different aspects of the sentence.

Understanding Multi-Head Attention and Positional Encoding is what forms a strong foundation to understanding Transformers in their full power.

Whether it’s for _machine translation, chatbot development, or content generation, understanding these mechanisms _is key to harnessing the true potential of modern AI. Transformers have redefined the entire field of NLP, and as we continue to explore their capabilities, we will see even greater advancements in how machines understand and generate human language.

If you enjoyed this post, feel free to share it and follow me for more deep dives into AI and machine learning!

DEV Community

How Positional Encoding & Multi-Head Attention Powers Transformers?

Introduction

Positional Encodings: Giving Order to the Words

Method 1: Why Not Just Use Integer Indices?

Method 2: Sinusoidal Positional Encoding.

Why Single-Process Self-Attention Won’t Suffice

How Multi-Head Attention Works: Capturing Multiple Contexts

How It Rolls:

Top comments (0)

Read next

Selfhost Ollama with Open WebUI Online

DeepSeek R1: A Comprehensive Overview

n8n: The Automation Revolution in the Business World - 6 Examples

How to Give Your .NET AI App a Memory: A Developer’s Guide to Persistent LLM Storage