Apify for Apify

Posted on Jan 18 • Originally published at blog.apify.com on Sep 25, 2023

Emotion detection through multinomial logistic regression

#ai #machinelearning #data

Were Apify , and our mission is to make the web more programmable. This tutorial on multinomial logistic regression was inspired by our work on getting better data for AI. Check us out .

What is emotion detection in AI?

Emotion detection using multinomial logistic regression is a method of predicting emotions from input features in machine learning. Teaching machines this skill is challenging. They need practice with diverse examples to predict emotions accurately due to their complexity.

Emotion detection from text is crucial in natural language processing. It serves various purposes, from analyzing sentiment to understanding customer feedback.

A significant application of emotion detection is recommendation systems. By comprehending user emotions through their feedback, we can suggest similar items to users. This enhances user experience and personalization, making recommendations more relevant and aligned with user preferences.

In order to train machines to detect emotions indeed requires labeled data for training purposes. In this tutorial, we'll utilize reviews from the IMDb website to train our model. IMDb is an online website containing comprehensive details regarding films, television programs, actors, and individuals within the entertainment industry. It's a popular source for film details, ratings, reviews, and other related content.

What is multinomial logistic regression?

Multinomial logistic regression is a statistical method in the field of AI used for classification to solve multiclass possible outcome problems. Here's an example:

Imagine you're working at a wildlife research center; your task is to classify animals into three categories: Birds , Mammals, and Reptiles , based on their characteristics. Data is collected on various animals, including two independent variables: body temperature , and number of legs . These variables serve as the basis for classifying animals into the three categories.

The classification problem revolves around predicting whether an animal belongs to the Birds , Mammals, or Reptiles category, depending on its body temperature and the number of legs. Multinomial logistic regression is helpful because it allows us to group animals into different categories based on their traits.

In this tutorial, however, we're going to do something quite different. We'll delve deep into multinomial logistic regression using real-world IMDb movie data that we'll collect by developing a web scraping tool to extract reviews.

Step 1: Data collection

We must install the following packages that facilitate web scraping IMDb data.

pip install beautifulsoup4 selenium webdriver-manager pandas nltk

Next, we need to import several libraries into our Python file.

import csv # Handles CSV file operations for reading and writing.import time # Manages delays and waits for specific conditions.import numpy as np # Supports numerical computations.from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium import webdriver # Automates web interactions and testingfrom webdriver_manager.chrome import ChromeDriverManager # Facilitate Chrome WebDriver download and management.from selenium.webdriver.common.by import By #Locates HTML elements on web pages.from selenium.webdriver.chrome.service import Service # Facilitate Chrome WebDriver download and management.

In this article, we'll focus on scraping user reviews for the movie Captain Marvel from the IMDb website. Once the environment is configured, it's important to identify where the desired reviews are situated on the IMDb website for scraping. To accomplish this, right-click on the page and choose the inspect element to examine the IMDb review page structure.

Inspect element opens your browsers developer tools, letting you examine a webpage's HTML and CSS code. This page shows that all user reviews are consistently present in the <div> element having "text show-more__control" class.

Not all user reviews on the IMDb user review page are displayed immediately. You need to click the " Load More" button to reveal additional reviews. You can locate the " Load More" button at the page's lower section. On inspecting the page, you'll notice that the ID load-more-trigger identifies this button. With this ID, you can employ a Python script to programmatically automate the process of clicking this button.

🗒 Note: This algorithm involves three primary steps:

1) repeatedly clicking a button to load all reviews,

2) extracting reviews and ratings from the HTML (a potentially time-consuming task), and finally,

3) saving the extracted data to a file.

Executing this process can be quite time-consuming, so running the code requires patience.

Below is the full code for extracting reviews from the IMDb website:

# Import libraries import csvimport timeimport numpy as npfrom selenium import webdriver from selenium.webdriver.chrome.service import Servicefrom webdriver_manager.chrome import ChromeDriverManagerfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.common.by import By# Declare some variablessmall_delay = 5 review_button_exists = Truecsv_file_path = 'imdb_user_reviews.csv'imdb_movie_link = "https://www.imdb.com/title/tt4154664/reviews/" # Install chrome driverdriver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))driver.get(imdb_movie_link) # Load all the reviews of the movie by clicking the buttonwhile review_button_exists: try: load_more_reviews = WebDriverWait(driver, small_delay).until(EC.element_to_be_clickable((By.ID, 'load-more-trigger'))) load_more_reviews.click() time.sleep(2) except: review_button_exists = False all_movie_reviews = WebDriverWait(driver,small_delay).until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="lister-item-content"]'))) all_reviews = [] # Extract IMDB reviews and rating for review in all_movie_reviews: try: imdb_rating = review.find_element(By.XPATH,'.//div[@class="ipl-ratings-bar"]').text imdb_rating = imdb_rating.replace("/10", "") except: imdb_rating = np.nan try: review = review.find_element(By.XPATH,'.//div[@class="text show-more__control"]').text except: review = review.find_element(By.XPATH,'.//div[@class="content"]').text full_review = { "Rating" : imdb_rating, "Review" : review } all_reviews.append(full_review) # Write the reviews and rating in csv filewith open(csv_file_path, mode='w', newline='', encoding='utf-8') as file: writer = csv.DictWriter(file, fieldnames=["Rating", "Review"]) writer.writeheader() for full in all_reviews: writer.writerow(full)driver.quit()

In the above, we extracted reviews and ratings from the IMDb website using selenium and XPath and saved them in imdb_user_reviews.csv.

🗒 Note: XPath, or the XML Path Language, is a query tool for selecting nodes from structured documents like XML and HTML. It's vital for tasks such as web scraping, parsing XML data, and querying XML databases, enabling precise extraction and manipulation of structured information.

Labeling techniques

At this stage, the dataset is unlabeled, but we need labeled data for the machine learning algorithm. There are several ways to label a dataset, depending on the nature of the data and the goals of your analysis. Here are some common approaches:

Manual labeling: Assign labels to data points manually.
Rule-based labeling: Use predefined rules or criteria for assigning labels to data.
Domain expert labeling: Involve domain experts with deep knowledge of the subject matter to label the data.
Active learning: Combine manual labeling with machine learning.
Weak supervision: Generates noisy labels using rules or domain knowledge to be iteratively refined and improved through various techniques.

Labeling the dataset

The dataset extracted from the IMDb website comprises 6,374 reviews. We'll employ a combination of weak supervision and manual labeling techniques for assigning labels. In the weak supervision phase, we create two rules based on the rating:

Rule 1: Assign the Happy emotion to reviews with a rating greater than 8.

Rule 2: Assign the Angry emotion to reviews with a rating less than 4.

The remaining reviews remain unlabeled. Subsequently, we perform manual labeling by reading and assigning Happy and Angry labels to the reviews.

Step 2: Data preprocessing

The data extracted from the IMDb website requires preprocessing due to irrelevant elements such as HTML tags, stop words, null values, duplicate reviews, special characters, and stop words.

Importing the dataset

import pandas as pdurl = "./imdb_user_reviews.csv"imdb_data=pd.read_csv(url)imdb_data.head()# Output# Rating Reviews Emotion# 0 8.0 I had low expectations going to see this movie... Happy# 1 9.0 Movie was really well done, been collecting co... Happy# 2 9.0 Great movie. You don't need a strong villain. ... Happy# 3 9.0 After finally discovering who she is Vers/ Car... Happy# 4 9.0 I am not one to usually write reviews, but I f... Happy

Dataset descriptive statistics

imdb_data.info()# Output# Data columns (total 3 columns):# # Column Non-Null Count Dtype # --- ------ -------------- ----- # 0 Rating 6269 float64# 1 Reviews 6374 object # 2 Emotion 6374 object

From these statistics, it's evident that the movie has a total of 6,374 reviews, each associated with specific emotions (Happy, Angry).

Let's calculate the frequency of emotions within the dataset.

imdb_data['Emotion'].value_counts()# Output# Happy 3903# Angry 2471

The provided output indicates that 3,903 users were happy after watching this movie, while 2,471 users were angry.

Identifying and removing duplicate reviews

Our dataset may contain duplicate reviews. To ensure effective machine training, it's crucial that we identify and eliminate these duplicate entries from the dataset.

num_duplicates = imdb_data.duplicated().sum() #identify duplicatesprint('Duplicate reviews = {}'.format(num_duplicates))# Duplicate reviews = 7# Removing Duplicate reviewsreview = imdb_data['Reviews']duplicated_review = imdb_data[review.isin(review[review.duplicated()])].sort_values("Reviews")duplicated_review.head()imdb_data.drop_duplicates(inplace = True)

Dataset cleaning

The dataset we scraped includes a substantial amount of irrelevant content, such as stopwords, HTML tags, emoticons, parentheses, and special characters. Our objective is to eliminate these elements from the dataset.

import nltkimport refrom bs4 import BeautifulSoupfrom nltk.corpus import stopwordsfrom nltk.tokenize import ToktokTokenizerstopWords = set(stopwords.words('english'))tokenizer=ToktokTokenizer()listOfStopword=nltk.corpus.stopwords.words('english')# Printing the set of English stopwordsprint(stopWords)# Function to remove square brackets and their contentsdef removeSquareBrackets(htmlText): return re.sub(r'\\[[^]]*\\]', '', htmlText)# Function to remove special characters (with an option to remove digits)def removeSpecialCharacters(htmlText, removeDigits=True): specialCharacterPattern = r'[^a-zA-Z0-9\\s]' cleanedText = re.sub(specialCharacterPattern, '', htmlText) return cleanedText# Function to remove HTML tags from textdef removeHtmlTags(htmlText): htmlFreeWord = BeautifulSoup(htmlText, "html.parser") return htmlFreeWord.get_text()# Text stemming using Porter Stemmerdef performTextStemming(htmlText): ps = nltk.porter.PorterStemmer() stemmedText = ' '.join([ps.stem(word) for word in htmlText.split()]) return stemmedText# Function to remove stopwordsdef removeStopwords(htmlText, isLowerCase=False): wordTokens = tokenizer.tokenize(htmlText) wordTokens = [token.strip() for token in wordTokens] if isLowerCase: token_filterization = [token for token in wordTokens if token not in listOfStopword] else: token_filterization = [token for token in wordTokens if token.lower() not in listOfStopword] textAfterRemovingStopWord = ' '.join(token_filterization) return textAfterRemovingStopWord# Function to remove emoticonsdef removeEmoticons(text): emoticonPattern = r'(:-?[\\)\\(DPO])|(;\\'-?[\\)\\(DPO])' textWithoutEmoticons = re.sub(emoticonPattern, '', text) return textWithoutEmoticons# Function for data cleaningdef performDataCleaning(text): text = removeHtmlTags(text) text = removeSquareBrackets(text) text = removeSpecialCharacters(text) text = performTextStemming(text) text = removeStopwords(text) text = removeEmoticons(text) return text# Applying data cleaning function to the 'Reviews' column in the 'imdbData' DataFrameimdb_data['Reviews'] = imdb_data['Reviews'].apply(performDataCleaning)

Step 3: Dataset modeling

At this point, our dataset has been thoroughly cleaned and is prepared for the modeling phase. In this phase, we split the data into testing and training. We have partitioned the dataset, allocating 80% for training and 20% for testing, with a random division facilitated by the random_state = 42.

from sklearn.model_selection import train_test_splitreviews = imdb_data['Reviews'] emotion = imdb_data['Emotion'] review_training_set, review_testing_set, emotion_training_set, emotion_testing_set = train_test_split(reviews, emotion, test_size=0.2, random_state=42)print(review_training_set.shape,emotion_training_set.shape)print(review_testing_set.shape,emotion_testing_set.shape)# Output# (5099,) (5099,)# (1275,) (1275,)

Step 4: Model training

In order to train the model, we need to convert the train and test dataset into Bag of Words and TF-IDF. Both approaches convert text datasets into numerical vectors.

Bag of Words model (BOW)

Bag of Words is a way to represent a text by counting the frequency of each word in it without considering the word order. Its called a bagbecause it's like throwing all the words into a bag. You only care about what words are in the bag and how many times each word appears, not their arrangement.

from sklearn.feature_extraction.text import CountVectorizer#Bag of words vectorizer vector_for_bag_of_word=CountVectorizer(min_df=0.0,max_df=1.0,binary=False,ngram_range=(1,3))# Train reviews transformationtrain_reviews_bag_of_word_vector=vector_for_bag_of_word.fit_transform(review_training_set)# Test reviews transformationtest_reviews_bag_of_word_vector=vector_for_bag_of_word.transform(review_testing_set)print('train_reviews_bag_of_word_vector:',train_reviews_bag_of_word_vector.shape)print('test_reviews_bag_of_word_vector:',test_reviews_bag_of_word_vector.shape)#output# train_reviews_bag_of_word_vector: (5093, 407268)# test_reviews_bag_of_word_vector: (1274, 407268)

Term Frequency-Inverse Document Frequency(TF-IDF)

Term Frequency means how often a word appears in the document, and Inverse Document Frequency indicates whether the word is rare or common across all the documents. TF-IDF is therefore a method of determining the importance of a word in a document relative to an entire collection of documents. In this context, we refer to each review as a document.

from sklearn.feature_extraction.text import TfidfVectorizer#Tfidf vectorizertfidf_vector=TfidfVectorizer(min_df=0.0,max_df=1.0,use_idf=True,ngram_range=(1,3))#train reviews transformationtrain_reviews_tfidf_vector=tfidf_vector.fit_transform(review_training_set)#test reviews transformationtest_reviews_tfidf_vector=tfidf_vector.transform(review_testing_set)print('Tfidf_train:',train_reviews_tfidf_vector.shape)print('Tfidf_test:',test_reviews_tfidf_vector.shape)# Output# train_reviews_tfidf_vector: (5093, 407268)# test_reviews_bag_of_word_vector: (1274, 407268)

Dataset modelling

Let's detect word mentions and test our dataset.

from sklearn.linear_model import LogisticRegression#Setting up the modelmultinomial=LogisticRegression(C=15, penalty='l2',solver='newton-cg',multi_class='multinomial', max_iter=10000)#Training the model using Bag of words features.multinomial_bow=multinomial.fit(train_reviews_bag_of_word_vector,emotion_training_set)print(multinomial_bow)#Predicting the model for Bag of words.multinomial_bow_predict=multinomial.predict(test_reviews_bag_of_word_vector)print(multinomial_bow_predict)#Training the model using TF-IDF features.multinomial_tfidf=multinomial.fit(train_reviews_tfidf_vector,emotion_training_set)print(multinomial_tfidf)#Predicting the model for tfidf featuresmultinomial_tfidf_predict=multinomial.predict(test_reviews_tfidf_vector)print(multinomial_tfidf_predict)

Multinomial logistic regression parameters

Let's delve into a detailed examination of the parameters employed in multinomial logistic regression.

1. multi_class: Specifies the strategy for handling multiclass problems. Common options include:

'ovr' (One-vs-Rest): Treats each class as binary and builds a separate binary classification model for each class.
'multinomial': Applies multinomial logistic regression, which directly models all classes together.
'auto': Automatically selects either 'ovr' or 'multinomial' based on the nature of the problem.

2. solver: Selects the algorithm used to optimize the logistic regression model. Common options include:

'newton-cg': Newton-Conjugate Gradient algorithm.
'lbfgs': Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm.
'liblinear': Library for large linear classification (works well for small datasets).
'sag': Stochastic Average Gradient Descent.
'saga': Improved version of 'sag' with L1 regularization support.

3. C: The 'C' parameter represents the inverse of regularization strength.

Smaller 'C'values mean stronger regularization, preventing overfitting.
Larger 'C' values allow the model to closely fit the training data.

4. penalty: Specifies the type of regularization to apply:

'l1': L1 regularization (Lasso) encourages sparsity.
'l2': L2 regularization (Ridge) controls the size of coefficients.
'elasticnet': Elastic-Net regularization - a combination of L1 and L2.

5. max_iter: Determines the solver's maximum iterations to reach a solution. Increase it if convergence isn't achieved within the default limit.

🗒 Regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to the loss function, discouraging excessively complex models, and promoting generalization.

Step 5: Evaluate model performance

We use accuracy as a performance parameter. Accuracy is a metric that measures the overall correctness of a predictive model. It represents the ratio of correctly predicted instances to the total number of instances in the dataset. In mathematical terms, accuracy is calculated as:

Accuracy=Number of Correct PredictionsTotal/Number of Predictions100

Accuracy score for Bag of Words:

from sklearn.metrics import accuracy_scoremultinomial_bow_score=accuracy_score(emotion_testing_set,multinomial_bow_predict)print("multinomial_bow_score :",multinomial_bow_score)# multinomial_bow_score : 0.8618524332810047

Accuracy score for TF-IDF:

from sklearn.metrics import accuracy_scoremultinomial_tfidf_score=accuracy_score(emotion_testing_set,multinomial_tfidf_predict)print("multinomial_tfidf_score :",multinomial_tfidf_score)# multinomial_tfidf_score : 0.8712715855572999

In terms of percentage, Bag of Words achieves an accuracy rate of 86% , while TF-IDF achieves a slightly higher accuracy of 87%.

There are many other factors that can improve the performance of the model, such as:

High-quality data
Model selection
Regularization
Handling missing data :
Domain knowledge
Error analysis

The effectiveness of these factors can vary depending on the specific problem and dataset, so it's essential to experiment and iterate to achieve the best results.

Summary

We began with data collection by web scraping IMDb movie reviews and ratings.
We then applied weak supervision and manual labeling techniques to assign labels to the data.
After labeling, we performed data cleaning to remove irrelevant information.
We then split the dataset into two categories: training and testing.
After that, we transformed the dataset into vectors using Bag of Words and TF-IDF methods.
Finally, we employed multinomial logistic regression on these vectors, fitting the model and assessing its performance using the accuracy metric.

DEV Community