Many of us have heard about NLP or Natural Language Processing , and probably you are here because you want to explore this domain.So in Simple terms NLP is a technology by which computers can understand human language. But what is Naive Bayes? Naive Bayes is an algorithm that is often used for achieving Natural Language Processing. Now if you have some prior experience with Machine Learning then you may know or heard of algorithms like KNN,Logistic Regression, Linear Regression .But in this blog we will be using Naive Bayes because it is better than any other algorithm especially for text classification and if you have not heard of the above algorithms then also it doesn’t matters as we hardly require it here
So I will be dividing this whole blog into two parts-
Text pre-processing
Naive Bayes
1.Text pre-processing
Okay so you have made it till here? good . I know that you may not have the idea of what text pre-processing is but let’s first understand why do we need it.
Let’s take a sentence — “I want to be a Computer Scientist”. Now how can we perform any mathematical operations on this piece of text without any modification because at the end our machine learning is some complex mathematical operation, the simple answer is just using piece of text and performing no further modification on it we will not be able to perform any mathematical operations. You will understand soon what do I mean by modification here.
So to modify a piece of text such that we can perform some mathematical operation is to change that text into a vector. Why a vector? Because if we can somehow convert our text into a vector then we can use all Linear Algebraic Operations on our text. Sound Interesting right?
There are a lot of ways by which we can convert out text into vectors-
i. Bag of Words
ii. Tf-Idf
iii. Word2Vec
iv. Average Word2Vec
Here in this blog I will be discussing only the first method i.e. Bag of words
Bag of Words
Bag of words is one of the widely used method to convert a given text into vector. So let’s see how bag of words actually work-
Step 1- We are given the following text data , for simplicity we will use four simple sentences.
d1: “Ubuntu is a great operating system for beginners”
d2: “Ubuntu is not a good operating system for beginners”
d3: “Ubuntu is an amazing operating system for beginners”
d4: “Ubuntu is the worst operating system for beginners”
These data in the domain of machine learning are called documents
Step 2- Next we will create a dictionary containing all the unique words of the four documents
Step 3- Now we will create a vector for each of the above documents. If a document has ‘d’ words then we will create a d-dimensional vector for it. We can create this vector by using a hashmap or a dictionary like structure . This structure will also have the same length as that of the dictionary that we created above and the values of this vector will be the number of times the corresponding word in the main dictionary that has occurred in the document we are processing. Okay I know a its a bit complex! Let’s simplify the concept using an example.
The vector for the first document will be -
Let’s understand why the resulting vectors looks so. The first value of the vector is 1 because the first word of the dictionary “Ubuntu” is present in the document d1, same goes for the indices 1,2,3,7,11,12,13,14 assuming that our index starts at 1. The fourth value of the vector is 0 because fourth word of the dictionary “an” is absent in d1 and same goes for the indices 4,5,6,8,9,10.
The following piece of code shows implementation of BOW in python-
Now if you observe carefully , you may have a doubt that why there are 13 features when there are actually 14 unique words , the simple answer is CountVectorizer ignores one lettered words, so it has ignored the letter 'a' and so it is showing 13 features and not 14
Along with this concept of BOW we have another simple concept called the binary BOW that just marks in the vector if a particular word is present or not present by using 1 or 0 respectively, it doesn’t displays the number of time a word is occurring in the vector. Now we need to process this vectors further and simplify them as more as possible and for this purpose we have some more text processing techniques —
a) Stop-word removal
Stop words are the words which on being removed from a document doesn’t really changes the nature of that document , like if that document was a positive one like ‘d1’ in our example , on removing the stop words from that document , it will still be positive and same goes for a negative document .
example: is , a , for , this etc
The following piece of code prints all the stop words for the English Alphabet-
Python code for printing stop words
stop words in English alphabet
b) Stemming
For some words the actual meaning is same, like the words ‘ bore ’ & ‘ boring ’ , these both words actually refer to the same meaning and so should be replaced by a single word
The following piece of code shows how to perform stemming in python-
c) Lemmatization
Lemmatization is basically breaking a document into individual words, or a group of two and sometimes more. If a document is divided into individual words then it is called uni-grams , if divided into a group of two words then it is called bi-grams and it goes up to n-grams.
example: if we divide d1 in the following way then it is called uni-gram
‘ Ubuntu ’ , ‘ is ’ , ‘a’ , ‘great’ , ‘OS’ , ‘for’ , ‘beginners’
if we divide d1 like this , then it is called bi-grams
‘ Ubuntu is’ , ‘ is a’ , ‘a great’ , ‘great OS’ , ‘OS for’ , ‘for beginners’
So these are the basic word processing techniques that we need to perform before applying NLP algorithms
Now after we have performed all the text processing we can jump into the actual Machine Learning part and understand how Naive Bayes can be used for NLP
Naive Bayes
Before you go further , it is very important for you to have the concepts of probability cleared. You need to have a clear concept of — Conditional Probability , Independent events, Multiplication theorem , Bayes theorem and all the other basic concepts of probability. If you have the concepts cleared then you are ready to understand the concepts and if not I strongly suggest you to learn those concepts and come back.
Okay , instead of directly showing you the formula I feel it is more important for one to understand how it is derived as the derivation tells us about the assumptions that we take while deriving the formula. So let’s start with the derivation
If you have gone through the whole derivation , then probably you have understood that in this whole derivation , the most important part is the assumption of conditional independence and without this assumption , the whole derivation can’t be done.
So the Naive Bayes formula is -
So given any data set we just need to calculate the value of the likelihoods and probability of the classes and we can get the probability value of the class label given the data point. For whichever class label the probability value is greatest , the data point x belongs to that class
Implementing Naive Bayes is extremely easy using python and can be done using the scikit learn library
from sklearn.naive_bayes import GaussianNB
NB=GaussianNB()
We can then pass the data using .fit() function and can finally predict the values using .predict()
BONUS Content
This part is not totally about machine learning but it is an intersection of Software Engineering and Machine Learning and I feel that it will help one to appreciate how the concept of data structures and algorithms are implemented in machine learning. In scikit learn library specifically for Naive Bayes , there is a special implementation of the Naive Bayes algorithm called the “Out of Core Naive Bayes”. So this “Out of Core Naive Bayes” is used in cases where the data is huge and our RAM is comparatively much smaller , basically the whole data will not fit in RAM. Now if you are familiar with the concept of External Merge Sort , then the concept used here is almost same i.e. the whole data will be divided into a size which can fit into RAM and they will be sorted individually and will be merged finally. Now there is a pretty good chance that the actual “Out of Core Naive Bayes” isn’t actually working this way or even if it is may be it is using some more advanced version or more optimised version of External Merge Sort. I just tried to motivate you to appreciate how these algorithms may be implemented in these scenarios.
Top comments (0)