Muhammad Saim

Posted on Jun 25

Introduction TO Word Embeddings

#ai #llm #nlp #datascience

Introduction

Word embedding is a technique in which words and sentences are converted into numbers. Our computer can understand only numbers, so representing this text as numbers is necessary for model training. Another thing is that using word embedding reduces the dimensionality, which is more efficient for the processing of data. There are many traditional and modern techniques, so first, we'll discuss traditional techniques and then modern techniques.

Traditional techniques for Word Embedding

One-Hot Encoding
TF-IDF vectorizer
Bag of Words

One-Hot Encoding

Using this scheme, all the other values are set to 0 except the current word value which is set to 1. Let's consider we have the sentence ['apple', 'Mango', 'Peach']:
apple: [1,0,0]
Mango: [0,1,0]
Peach: [0,0,1]

Bag of Words

In Bag of Words, an unordered set of words and their frequencies are considered. Each word in the sentence is divided by the total occurrences in the text. Below, there is an example.

Example:

Consider the following two sentences:
"The cat sat on the mat."
"The cat played with the cat."

Step-by-Step Process:

Tokenization:

Sentence 1: ["The", "cat", "sat", "on", "the", "mat"]
Sentence 2: ["The", "cat", "played", "with", "the", "cat"]

Case Normalization (optional):

Sentence 1: ["the", "cat", "sat", "on", "the", "mat"]
Sentence 2: ["the", "cat", "played", "with", "the", "cat"]
Build Vocabulary: Unique words: ["the", "cat", "sat", "on", "mat", "played", "with"]
Count Frequencies:
Sentence 1:
"the": 2
"cat": 1
"sat": 1
"on": 1
"mat": 1
"played": 0
"with": 0
Sentence 2:
"the": 2
"cat": 2
"sat": 0
"on": 0
"mat": 0
"played": 1
"with": 1

Calculate Total Word Counts:

Sentence 1: 6 words
Sentence 2: 6 words

Normalize Frequencies:

Sentence 1:
"the": 2/6 = 0.333
"cat": 1/6 = 0.167
"sat": 1/6 = 0.167
"on": 1/6 = 0.167
"mat": 1/6 = 0.167
"played": 0/6 = 0.000
"with": 0/6 = 0.000
Sentence 2:
"the": 2/6 = 0.333
"cat": 2/6 = 0.333
"sat": 0/6 = 0.000
"on": 0/6 = 0.000
"mat": 0/6 = 0.000
"played": 1/6 = 0.167
"with": 1/6 = 0.167

Representation:

Sentence 1: [0.333, 0.167, 0.167, 0.167, 0.167, 0.000, 0.000]
Sentence 2: [0.333, 0.333, 0.000, 0.000, 0.000, 0.167, 0.167]
Term Frequency and Inverse Document Frequency

Term Frequency and Inverse Document Frequency (TF-IDF)

TF-IDF is a numerical statistic that reflects the importance of a term in a document relative to a collection of documents. This method is widely used in text mining and information retrieval. It consists of two components: Term Frequency (TF) and Inverse Document Frequency (IDF).TF-IDF is a numerical statistic that reflects the importance of a term in a document relative to a collection of documents. TF-IDF consists of two components:
Term Frequency (TF): Term Frequency measures how often a term (word) appears in a document.

Neural Networks:

In 2013, Google published a paper in which they solved a similar problem. They introduced a new way of word embedding in which they tried to capture the semantic relationship between words. There are two techniques for word2vec: CBOW and skip-gram. The traditional techniques were good but they were not able to capture semantics in words.

CBOW

Before understanding the concept of CBOW, we need to understand the concept of windowing. A context window refers to the surrounding words around the target word. For example, if I have the sentence “Pakistan is a great country for tourism”, and I select a context window size of 2 with my target word being ‘great’, the 2 words before ‘great’ (Pakistan is) and the two words after ‘great’ (country for) are in the context window. A sliding window refers to a fixed size window that, after processing one context window, moves to the next window. This allows the model to pass through all the text.
Now, in a neural network, the context windows are passed through the input layer, the target word is placed in the output layer, and between them are the hidden layers. Dimensionality reduction occurs in the hidden layers.
Sentence: "Data science is transforming industries."

Training Examples:

Context Words: ["Data", "is"]
Target Word: "science"
Context Words: ["Data", "science", "transforming"]
Target Word: "is"
Context Words: ["science", "is", "industries"]
Target Word: "transforming"
Context Words: ["is", "transforming"]
Target Word: "industries"
In this example, for each target word, the context words within a window size of 2 are used to create the training data for the CBOW model.

Skip-gram:

Skip-gram is a technique which is based on predicting surrounding words based on a specific word. It is just like the inverse of CBOW. It predicts the word by analyzing surrounding words. If the sentence is like "king wore a golden crown", skip-gram will take the words "wore" and "golden" and predict "king" and "crown".

DEV Community