Kaushal

Posted on Jul 5, 2020 • Edited on Jul 12, 2020

Scraping memes from reddit using node.js

#webdev #javascript #beginners #node

We all like memes, don't we? If you thought of making an application which serves memes from the internet but didn't know how to, you've come to the right post! Here I will show you how to scrap memes from Reddit yourself, and not relying on any other APIs. So let's get started!

We'll be using axios and cheerio for web scraping.

tl;dr

All the code demonstrated in this post is up on GitHub

Prerequisites:

Node.js installed.
Npm installed. This should come with node.js.

To check if they are installed, type

node --version

and

npm --version

Now that everything is installed, we can start.

Start

We will start on an empty folder. Run

npm init -y

to generate a package.json file. Now we can install required dependencies. Run

npm install axios cheerio

Now let's actually start coding some JavaScript!

Make a file names index.js in the root directory of the project, and open it in your preferred text editor.

Now import the required libraries into your project.

const axios = require("axios");
const cheerio = require("cheerio");

Now we will choose a site to scrap from. For the sake of this guide I will be scraping memes from r/dankmemes.

const mainUrl = `https://reddit.com/r/dankmemes`;

According to the documentation of axios, we will set up the intial code.

axios
    .get(mainUrl)
    .then((response) => {
        console.log(response.data);
    })
    .catch((err) => {
        console.log(err);
    });

The .get() method takes in the URL of the site. Because axios runs asynchronously, you have to add a .then() method to do something with the data. We will just take the data and log it in the console.
In this process, if something goes wrong, the .catch() method catches the error and displays it. This is put for better error handling.
So now let's finally run the code and see what we get!

node index.js

And whew! We get loads of text. But this "text" is actually the HTML code what is hosted in the URL we specified.

But we needed only the image source right? So now we should parse and filter the HTML got using an amazing library called cheerio. Cheerio helps us parse HTML in a jQuery like fashion, which makes is amazingly easy to do out job. And it's fast too!

But we need to see what to filter right? To know that, we have to visit the URL we specified, which is https://reddit.com/r/dankmemes. So head on to the site in another tab.

When everything finished loading, right click on any image post and choose inspect element. This should open chrome's developer tools. When the image element is highlighted, you should see some other attributes inside the <img /> tag.

Below I have taken a random post on the subreddit, and you can see a src="" attribute on the right side of the screen. That is the data we need to scrap! But how to we exactly locate that image? Simple, we look into other attributes of the same HTML element.

Here in our case we can see that the image as a class of

<img alt="Post image" class="_2_tDEnGMLxpM6uOa2kaDB3 ImageBox-image media-element _1XWObl-3b9tPy64oaG6fax" src="https://preview.redd.it/g64fe51e6z851.jpg?width=640&crop=smart&auto=webp&s=c5917f6…" style="max-height: 512px;">

Bingo! We got hold of the different classes in the image tag. So now let's continue with parsing this HTML data.

SO instead of console logging the HTML, we will pass it into another function to parse this data.
So this should be your axios part so far.

axios
    .get(mainUrl)
    .then((response) => {
        dealWithData(response.data);
    })
    .catch((err) => {
        console.log(err);
    });

Now create a function called dealWithData() or any other name you have given in the .then() method.

Now we will add some code inside that new function.

const dealWithData = (html) => {
    const $ = cheerio.load(html);
    const urlMeme = $("._2_tDEnGMLxpM6uOa2kaDB3.ImageBox-image.media-element._1XWObl-3b9tPy64oaG6fax");
    const indexValue = 0;
    console.log(`Source is:\n${urlMeme[indexValue].attribs.src}`);
};

Here, we are assigning the $ sign with the parsed html content using cheerio, just to make it more jQuery-like.

Now that we know our image element's attributes, we assign the HTML content of the image to a variable. Note that all images will have the same classname, so you will get an array of image elements returned. You can try to console log to see it. So we will assign an index value 0, which gets us the first image. And then we log the src of the image element with its index value. This will take some time to run but you should eventually get the result.

But there is a problem here. Notice that if you run this multiple times, there is a chance you get the same image source again and again. So instead of hard coding the index value, we will generate a random value.

const randNo = (limit) => {
    const thatNo = Math.floor(Math.random() * limit);
    return thatNo;
};

Also don't forget to update the same in your dealWithData() function or whatever name you have given it.

const indexValue = randNo(urlMeme.length);

We will pass the length of the array to get a random index number. Now putting together all this code, we will run the full program.

And wow! you should get an output of this kind.

Source is:
https://preview.redd.it/gnmgdb09q0951.jpg?width=640&crop=smart&auto=webp&s=8175c12e8aaa356af8f7cc78fe4e0b83d37341e2

And done. You can visit the link and check. You now have your very own meme scraper!

The same code can be used to scrap from different subbreddits. Just specify the required URL when starting out.

Happy coding! Cheers!

Top comments (1)

twoBn • Jan 6 '21

Is there a way to use Cheerio to find the post title too? I've tried to do it but with no luck.

DEV Community

Scraping memes from reddit using node.js

tl;dr

Prerequisites:

Start

Top comments (1)

Read next

Why Monorepo Projects Sucks: Performance Considerations with Nx

Building a Discord Bot with OpenAI GPT

Elastic Load Balancing (ELB): Ensuring High Availability and Reliability

🚀 Introducing Odysseus-CLI: Simplified Deployment for Laravel, React and other Applications