DEV Community

Samagra Shrivastava
Samagra Shrivastava

Posted on

Sarcasm Detection using Machine Learning.

I’ll walk you through the task of detecting sarcasm with machine learning using the Python programming language.

It reads a dataset of headlines labeled as sarcastic or non-sarcastic, processes the data to map the labels into human-readable form, and converts the text data into a matrix of token counts using the CountVectorizer.

The data is then split into training and testing sets, and a Bernoulli Naive Bayes classifier is trained on the training set. The model's accuracy is evaluated on the test set, and it can also predict whether new user-inputted text is sarcastic or not.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
Enter fullscreen mode Exit fullscreen mode

These lines import the necessary libraries:

  • pandas (pd) for data manipulation.
  • numpy (np) for numerical operations.
  • CountVectorizer from sklearn for converting text data into a matrix of token counts.
  • BernoulliNB from sklearn for implementing the Bernoulli Naive Bayes classifier.
  • train_test_split from sklearn for splitting data into training and testing sets.
data = pd.read_json("https://raw.githubusercontent.com/amankharwal/Website-data/master/Sarcasm.json", lines=True)
Enter fullscreen mode Exit fullscreen mode

This line reads JSON data from the given URL into a pandas DataFrame. The lines=True argument specifies that each line in the file is a separate JSON object.

data.head()
Enter fullscreen mode Exit fullscreen mode

Displays the first few rows of the DataFrame to give an overview of the data.

data.tail()
Enter fullscreen mode Exit fullscreen mode

Displays the last few rows of the DataFrame to give another overview of the data.

data.columns
Enter fullscreen mode Exit fullscreen mode

Shows the column names of the DataFrame.

data.shape
Enter fullscreen mode Exit fullscreen mode

Displays the dimensions (number of rows and columns) of the DataFrame.

data['is_sarcastic'] = data['is_sarcastic'].map({0:'No Sarcasm', 1: 'Sarcasm'})
Enter fullscreen mode Exit fullscreen mode

Maps the values in the is_sarcastic column from 0 and 1 to 'No Sarcasm' and 'Sarcasm' respectively.

data.head()
Enter fullscreen mode Exit fullscreen mode

Displays the first few rows of the DataFrame again to show the updated is_sarcastic column.

data = data[['headline', 'is_sarcastic']]
Enter fullscreen mode Exit fullscreen mode

Selects only the headline and is_sarcastic columns from the DataFrame for further analysis.

x = np.array(data['headline'])
y = np.array(data['is_sarcastic'])
Enter fullscreen mode Exit fullscreen mode

Converts the headline and is_sarcastic columns to numpy arrays, assigning them to x and y respectively.

cv = CountVectorizer()
Enter fullscreen mode Exit fullscreen mode

Creates an instance of CountVectorizer to transform the text data into a matrix of token counts.

X = cv.fit_transform(x)
Enter fullscreen mode Exit fullscreen mode

Fits the CountVectorizer to the headlines and transforms them into a sparse matrix of token counts, assigned to X.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

Splits the data into training and testing sets. 80% of the data is used for training and 20% for testing. The random_state=42 ensures reproducibility.

model = BernoulliNB()
Enter fullscreen mode Exit fullscreen mode

Creates an instance of the Bernoulli Naive Bayes classifier.

model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Trains the model using the training data (X_train and y_train).

print(model.score(X_test, y_test))
Enter fullscreen mode Exit fullscreen mode

Prints the accuracy of the model on the test data.

user = input("Enter the text here")
Enter fullscreen mode Exit fullscreen mode

Prompts the user to enter a piece of text for sarcasm detection.

data = cv.transform([user]).toarray()
Enter fullscreen mode Exit fullscreen mode

Transforms the user input text into the same format as the training data (a sparse matrix of token counts).

output = model.predict(data)
Enter fullscreen mode Exit fullscreen mode

Uses the trained model to predict whether the user input text is sarcastic or not.

print(output)
Enter fullscreen mode Exit fullscreen mode

Prints the prediction result.

You can find the dataset here, and colab notebook here also you can follow me on Github.

Happy Coding!

Top comments (0)