DEV Community

Cover image for Sarcasm Detection AI Model (97% Accuracy) Trained With Reddit Comments - Part 1
Steven Mathew
Steven Mathew

Posted on • Edited on

Sarcasm Detection AI Model (97% Accuracy) Trained With Reddit Comments - Part 1

I have trained a Sarcasm Detection AI model using Reddit comments. This is how you can do it too.

Requirements:
Google Colab
Reddit API Credentials
Lots of time
Coffee

  1. First we will import the necessary libraries.
import asyncio  # For asynchronous programming in Python.
import asyncpraw  # Python Reddit API Wrapper for asynchronous Reddit API interactions.
import pandas as pd  # Data manipulation and analysis tool.
import nest_asyncio  # Necessary for allowing nested asyncio run loops.
import re  # Regular expressions for pattern matching and text manipulation.
from sklearn.model_selection import train_test_split  # Splits data into training and testing sets.
from sklearn.feature_extraction.text import TfidfVectorizer  # Converts text data into TF-IDF feature vectors.
from sklearn.ensemble import RandomForestClassifier  # Random Forest classifier for machine learning.
from sklearn.metrics import accuracy_score, classification_report  # Metrics for evaluating model performance.
from imblearn.over_sampling import SMOTE  # Oversampling technique for handling class imbalance.
from sklearn.pipeline import Pipeline  # Constructs a pipeline of transformations and estimators.
from sklearn.model_selection import GridSearchCV  # Performs grid search over specified parameter values.
Enter fullscreen mode Exit fullscreen mode
  1. Connecting to Reddit API Get your API credentials from https://www.reddit.com/prefs/apps
`client_id = 'your_client_id'
client_secret = 'your_client_secret'
user_agent = 'MyRedditApp/0.1 by your_username'

reddit = praw.Reddit(client_id=client_id,
                     client_secret=client_secret,
                     user_agent=user_agent)`
Enter fullscreen mode Exit fullscreen mode

This code sets up authentication credentials (client_id, client_secret, user_agent) to create a Reddit API connection using praw. The Reddit object initializes a connection to Reddit's API, allowing the Python script to interact with Reddit, retrieve data, and perform various actions programmatically on the platform.

  1. Initialization and Setup
`nest_asyncio.apply()`
Enter fullscreen mode Exit fullscreen mode

This line ensures that asyncio can be used in a nested manner, which is necessary when using asynchronous operations in environments that already have an event loop running.

Asynchronous Function Definition

`async def collect_reddit_comments(subreddit_name, keyword, limit=1000):
    reddit = asyncpraw.Reddit(
        client_id=client_id,
        client_secret=client_secret,
        user_agent=user_agent
    )`
Enter fullscreen mode Exit fullscreen mode

Defines an asynchronous function collect_reddit_comments to retrieve comments from Reddit. It initializes a Reddit instance using asyncpraw, passing in credentials (client_id, client_secret, user_agent) for API authentication.

Fetching Subreddit and Comments

`subreddit = await reddit.subreddit(subreddit_name)
comments = []
count = 0
after = None`
Enter fullscreen mode Exit fullscreen mode

Asynchronously fetches the subreddit object based on subreddit_name. Initializes an empty list comments to store comment data, and sets counters (count) and pagination marker (after) for comment retrieval.

Looping Through Submissions and Comments

`while len(comments) < limit:
    try:
        async for submission in subreddit.search(keyword, limit=None, params={'after': after}):
            await submission.load()
            submission.comment_limit = 0
            submission.comments.replace_more(limit=0)`
Enter fullscreen mode Exit fullscreen mode

Explanation: Enters a loop to fetch submissions matching keyword within the specified subreddit. Asynchronously loads submission details and retrieves all comments for each submission, handling cases where more comments are nested (replace_more).

Collecting and Storing Comments

           ` for comment in submission.comments.list():
                if isinstance(comment, asyncpraw.models.Comment):
                    author_name = comment.author.name if comment.author else '[deleted]'
                    comments.append([comment.body, author_name, comment.created_utc])
                    count += 1

                    if count >= limit:
                        break

            after = submission.id  # Sets the 'after' parameter for pagination

            if count >= limit:
                break`
Enter fullscreen mode Exit fullscreen mode

Iterates through each comment in the submission, checking if it's a valid comment. Collects comment details such as body, author name, and creation time (created_utc). Controls the loop with count and limit to ensure the specified number of comments (limit) is collected.

Handling API Exceptions

    `except asyncpraw.exceptions.APIException as e:
        print(f"API exception occurred: {e}")
        wait_time = 60  # Wait for 1 minute before retrying
        print(f"Waiting for {wait_time} seconds before retrying...")
        await asyncio.sleep(wait_time)`
Enter fullscreen mode Exit fullscreen mode

Catches and handles API exceptions that may occur during Reddit API interactions. Prints the exception message, waits for a minute (wait_time) before retrying, and then resumes fetching comments.

Returning Results

`return comments[:limit]`  # Returns up to 'limit' number of comments
Enter fullscreen mode Exit fullscreen mode

Returns a list of collected comments, limited by the specified limit, ensuring only the required number of comments are returned.

Main Function to Execute Collection

async def main():
    comments = await collect_reddit_comments('sarcasm', 'sarcastic', limit=5000)  # Adjust limit as needed
    df = pd.DataFrame(comments, columns=['comment', 'author', 'created_utc'])
    df.to_csv('reddit_comments.csv', index=False)
    print(f"Total comments collected: {len(df)}")
    print(df.head())
Enter fullscreen mode Exit fullscreen mode

Defines an asynchronous main function to orchestrate the comment collection process. Calls collect_reddit_comments with parameters subreddit_name='sarcasm', keyword='sarcastic', and limit=5000 (can be adjusted). Converts collected comments into a Pandas DataFrame (df), stores it as a CSV file (reddit_comments.csv), and prints summary information about the collected data.

Running the Main Function

`await main()`
Enter fullscreen mode Exit fullscreen mode

Executes the main function asynchronously, initiating the process of collecting Reddit comments, processing them into a DataFrame, saving them to a CSV file, and providing feedback on the number of comments collected and a preview of the data.

Read the Part 2 - Sarcasm Detection From Reddit Comments : Cleaning & Saving The Data

GITHUB: https://github.com/stevie1mat/Sarcasm-Detection-With-Reddit-Comments

Author: Steven Mathew

Top comments (0)