Have you ever encountered a web page requiring actions like “clicking a button” to reveal more content? Such pages are called "dynamic webpages," as they load more content based on user interaction. In contrast, static webpages display all their content at once without requiring user actions.
Scraping content from dynamic pages can be daunting as it requires simulating user interactions, such as clicking a button to access additional hidden content. In this tutorial, you'll learn how to scrape data from a webpage with infinite scrolling via a "Load more" button.
Prerequisites
To follow along with this tutorial, you need:
- Node.js: Install the version tagged "LTS" (Long Time Support), which is more stable than the latest version.
- Npm: This is a package manager used to install packages. The good news is that “npm” is automatically installed with Node.js, which makes things much faster.
- Cheerio: For parsing HTML
- Puppeteer: You’ll use this to control a headless browser.
- An IDE for building the Scraper: You can get any code editor, like Visual Studio Code.
In addition, you’ll need to have a basic understanding of HTML, CSS, and JavaScript. You’ll also need a web browser like Chrome.
Initialize the Project
Create a new folder, then open it in your code editor. Spot the “terminal” tab in your code editor and open a new terminal. Here’s how you can spot it using Visual Studio Code.
Next, run the following command in the terminal to install the needed packages for this build.
$ npm install cheerio puppeteer
Create a new file inside your project folder in the code editor and name it dynamicScraper.js
.
Excellent work, buddy!
Accessing the Content of the Page
Puppeteer is a powerful Node.js library that allows you to control headless Chrome browsers, making it ideal for interacting with webpages. With Puppeteer, you can target a webpage using the URL, access the contents, and easily extract data from that page.
In this section, you’ll learn how to open a page using a headless browser, access the content, and retrieve the HTML content of that page. You can find the target website for this tutorial here.
Note: You’re to write all the code inside of the dynamicScraper.js
.
Start by importing Puppeteer using the require() Node.js built-in function, which helps you load up modules: core modules, third-party libraries (like Puppeteer), or custom modules (like your local JS files).
const puppeteer = require('puppeteer');
Next, define a variable for storing your target URL. Doing this isn’t mandatory, but it makes your code cleaner, as you just have to reference this global variable from anywhere in your code.
const url = 'https://www.scrapingcourse.com/button-click';
The next step is to create the function that’ll launch the headless browser and retrieve the HTML contents of the target page. You should opt for the Immediately Invoked Function Expression (IIFE) method to make things much faster.
Define an asynchronous IIFE with a try-and-catch block:
(async () => {
try {
// Code goes here
} catch (error) {
console.error('Error:', error.message);
}
})();
Note: You should write every other code for this tutorial segment inside the try block.
Right inside the IIFE, create a new instance of Puppeteer and open a new page for the interaction.
Launch a new instance of the puppeteer library using the launch method and pass the headless mode to it. The headless mode can be either set to true or false. Setting the headless mode to true makes the headless browser not visible when the puppeteer is launched, but setting it to false makes the browser visible.
After you’ve launched Puppeteer, you also want to call the newPage method, which triggers the opening of a new tab in the headless browser.
// Launch Puppeteer
const browser = await puppeteer.launch({ headless: false }); // Headless mode
const page = await browser.newPage(); // Open a new page
Now, query the newPage method to target the expected URL and open that website in this new tab using the page.goto
method. Beyond that, you want to ensure that Puppeteer only considers the page ready for interaction and extraction of data if and only if the page has loaded all of its essential resources (like images and JS).
To ensure the page is ready, Puppeteer provides an option called waitUntil
, which can take in various values that define different conditions for loading the page:
load: This waits for the load event to fire, which occurs after the HTML document and its resources (e.g., images, CSS, JS) have been loaded. However, this may not account for additional JavaScript-rendered content that loads after the
load
event.domcontentloaded: This waits for the
DOMContentLoaded
event, which is triggered once the initial HTML is parsed. But this loads before external resources (like images or additional JS) load.networkidle2: This waits until there are no more than two active network requests (ongoing HTTP requests (e.g., loading images, scripts, or other resources)) for 500 milliseconds. This value is preferred when dealing with pages that make small, continuous requests but don't affect the main content.
// Navigate to the target URL
await page.goto(url, {
waitUntil: 'networkidle2', // Ensure the page is fully loaded
});
Finally, you just need to retrieve all the HTML content of the current page using the page.content()
. Most importantly, you should close the browser instance to avoid unnecessary usage of memory, which can slow your system down. Use browser.close()
at the end of your script to close the browser.
// Get the full HTML content of the page
const html = await page.content();
// Log the entire HTML content
console.log(html);
// Close the browser
await browser.close();
With the present code you have, the browser will load up and close very fast, and you might not even be able to view the page well. In this case, you can delay the browser for a few seconds using the page.waitForTimeout
method. This method should come just before the browser.close
method.
// Delay for 10 seconds to allow you to see the browser
await page.waitForTimeout(10000);
Here’s the entire code for this section:
const puppeteer = require('puppeteer');
const url = 'https://www.scrapingcourse.com/button-click';
(async () => {
try {
// Launch Puppeteer
const browser = await puppeteer.launch({ headless: false }); // Headless mode
const page = await browser.newPage(); // Open a new page
// Navigate to the target URL
await page.goto(url, {
waitUntil: 'networkidle2', // Ensure the page is fully loaded
});
// Get the entire HTML content of the page
const html = await page.content();
// Log the entire HTML content
console.log(html);
// Delay for 10 seconds to allow you to see the browser
await page.waitForTimeout(10000);
// Close the browser
await browser.close();
} catch (error) {
console.error('Error fetching the page:', error.message);
}
})();
Save your file and run the script inside of your terminal using the command below:
$ node dynamicScraper.js
The script will open up a headless browser like the one below:
The browser loads up, Puppeteer fetches its entire HTML content, and Console logs the content to the terminal.
Here’s the output you should get in your terminal:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Load More Button Challenge - ScrapingCourse.com</title>
</head>
<body>
<header>
<!-- Navigation Bar -->
<nav>
<a href="/">
<img src="logo.svg" alt="Logo">
<span>Scraping Course</span>
</a>
</nav>
</header>
<main>
<!-- Product Grid -->
<div id="product-grid">
<div class="product-item">
<a href="/ecommerce/product/chaz-kangeroo-hoodie">
<img src="mh01-gray_main.jpg" alt="Chaz Kangeroo Hoodie">
<span class="product-name">Chaz Kangeroo Hoodie</span>
<span class="product-price">$52</span>
</a>
</div>
<div class="product-item">
<a href="/ecommerce/product/teton-pullover-hoodie">
<img src="mh02-black_main.jpg" alt="Teton Pullover Hoodie">
<span class="product-name">Teton Pullover Hoodie</span>
<span class="product-price">$70</span>
</a>
</div>
<!-- Additional products (3-12) follow the same structure -->
</div>
<!-- Load More Button -->
<div id="load-more-container">
<button id="load-more-btn">Load more</button>
</div>
</main>
</body>
</html>
Note that the code structure above is what your output should look like.
Wow! You should be proud of yourself for getting this far. You’ve just completed your first attempt at scraping the contents of a webpage.
Simulate the LOad More Products Process
Here, you want to access more products, and to do that, you need to click on the “Load more” button multiple times until you’ve either exhausted the list of all products or gotten the desired number of products you want to access.
To access this button and click on it, you must first locate the element using any CSS selectors (the class, id, attribute of the element, or tag name).
This tutorial aims to get at least 48 products from the target website, and to do that, you’ll have to click on the “Load more” button at least three times.
Start by locating the “Load more” button using any of the CSS selectors on it. Go to the target website, find the “Load more” button, right-click, and select the inspect
option.
Selecting the inspect option will open up developer tools just like the page below:
The screenshot above shows that the “Load more” button element has an id
attribute with the value "load-more-btn". You can use this id
selector to locate the button during the simulation and click on it multiple times.
Back to the code, still inside the try
block, after the line of code that logs out the previous HTML content for the default 12 products on the page.
Define the number of times you want to click the button. Recall that each click loads an additional 12 products. For 48 products, three clicks are required to load the remaining 36.
// Number of times to click "Load More"
const clicks = 3;
Next, you want to loop to simulate the clicks. The simulation will use a for
loop that runs i
number of times, where i
will be the clicks
variable.
for (let i = 0; i < clicks; i++) {
try {
} catch (error) {
console.log('No more "Load More" button or an error occurred:', error.message);
break; // Exit the loop if no button is found or an error occurs
}
}
Note: Your remaining code for this section should be written inside the try block in the for loop.
To help with debugging and tracking the output, log out the current click attempt.
console.log(`Clicking the 'Load More' button - Attempt ${i + 1}`);
Next, you want to be able to locate the “Load more” button and click it at least three times. But before simulating the click, you should ensure the “Load more” button is available.
Puppeteer provides the waitForSelector()
method to check the visibility of an element before using it.
For the “Load more” button, you’ll have to first locate it using the value of the id
selector on it and then check for the visibility status like this:
// Wait for the "Load More" button to be visible and click it
await page.waitForSelector('#load-more-btn', { visible: true });
Now that you know the “Load more” button is available, you can click it using the Puppeteer click()
method.
// Click the Load more button once it is available
await page.click('#load-more-btn');
Once you simulate a click on the “Load more” button, you should wait for the content to load up before simulating another click since the data might depend on a server request. You must introduce a delay between the requests using the setTimeout()
.
The code below notifies the script to wait at least two seconds before simulating another click on the “Load more” button.
// Wait 2 seconds for the new content to load using setTimeout
await new Promise(resolve => setTimeout(resolve, 2000));
To wrap things up for this section, you want to fetch the current HTML content after each click using the content()
method and then log out the output to the terminal.
// Get and log the updated full HTML after each click
html = await page.content();
console.log(`Full HTML after ${12 * (i + 2)} products loaded:`);
console.log(html);
Your complete code up until now:
const puppeteer = require('puppeteer');
(async () => {
try {
const browser = await puppeteer.launch({ headless: false }); // Launch the browser
const page = await browser.newPage(); // Open a new page
// Navigate to the target website
await page.goto('https://www.scrapingcourse.com/button-click', {
waitUntil: 'networkidle2', // Wait until the network is idle
});
console.log("Initial page loaded with 12 products");
// Get full HTML of the initial page
let html = await page.content();
// Log the full HTML (first 12 products)
console.log(html);
// Number of times to click "Load More"
const clicks = 3;
for (let i = 0; i < clicks; i++) {
try {
console.log(`Clicking the 'Load More' button - Attempt ${i + 1}`);
// Wait for the "Load More" button to be visible and click it
await page.waitForSelector('#load-more-btn', { visible: true });
await page.click('#load-more-btn');
// Wait 2 seconds for the new content to load using setTimeout
await new Promise(resolve => setTimeout(resolve, 2000));
// Get and log the updated full HTML after each click
html = await page.content();
console.log(`Full HTML after ${12 * (i + 2)} products loaded:`);
console.log(html);
} catch (error) {
console.log('No more "Load More" button or an error occurred:', error.message);
break; // Exit the loop if no button is found or an error occurs
}
}
// Delay for 10 seconds to allow you to see the browser
await new Promise(resolve => setTimeout(resolve, 10000));
await browser.close(); // Close the browser
} catch (error) {
console.error('Error fetching the page:', error.message);
}
})();
Here’s the output of simulating the button click three times to get 48 products:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Load More Button Challenge</title>
</head>
<body>
<header>
<nav>
<!-- Navigation and Logo Section -->
</nav>
</header>
<main>
<h1>Load More Products</h1>
<!-- Products Section -->
<div id="product-grid">
<!-- Product 1 -->
<div class="product-item">
<a href="https://scrapingcourse.com/ecommerce/product/chaz-kangeroo-hoodie">
<img src="https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg" alt="Chaz Kangeroo Hoodie">
<div>
<span>Chaz Kangeroo Hoodie</span>
<span>$52</span>
</div>
</a>
</div>
<!-- Product 2 -->
<div class="product-item">
<a href="https://scrapingcourse.com/ecommerce/product/teton-pullover-hoodie">
<img src="https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh02-black_main.jpg" alt="Teton Pullover Hoodie">
<div>
<span>Teton Pullover Hoodie</span>
<span>$70</span>
</div>
</a>
</div>
<!-- Product 3 -->
<div class="product-item">
<a href="https://scrapingcourse.com/ecommerce/product/bruno-compete-hoodie">
<img src="https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh03-black_main.jpg" alt="Bruno Compete Hoodie">
<div>
<span>Bruno Compete Hoodie</span>
<span>$63</span>
</div>
</a>
</div>
<!-- ... -->
<!-- Additional products follow the same structure -->
<!-- Total of 48 products loaded -->
</div>
<!-- Load More Button -->
<div id="load-more-container">
<button id="load-more-btn">Load More</button>
</div>
</main>
<!-- Bootstrap and jQuery libraries -->
<script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@popperjs/core@2.5.2/dist/umd/popper.min.js"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/js/bootstrap.min.js"></script>
</body>
</html>
Parse Product Information
It’s essential that you parse the current output (the entire HTML of the 48 products) to make it readable and well-structured. To make the current output meaningful, you’ll need to extract specific information for each product, such as its name, price, image URL, and link.
You need to visit the target website, load more products, and check through them to see the structure of each product and know which class
selectors to use to get the information needed.
The code snippet below is what a product structure looks like:
<div class="product-item flex flex-col items-center rounded-lg">
<a href="https://scrapingcourse.com/ecommerce/product/bruno-compete-hoodie">
<img class="product-image rounded-lg" width="200" height="240" decoding="async" fetchpriority="high" src="https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh03-black_main.jpg" alt="Bruno Compete Hoodie">
<div class="product-info self-start text-left w-full">
<span class="product-name">Bruno Compete Hoodie</span>
<br>
<span class="product-price text-slate-600">$63</span>
</div>
</a>
</div>
While inspecting each product, you should notice they all have a common class: product-item
. You’ll use this class to get each product in the list of products.
These are the CSS selector values of the product elements you need to parse: product name (.product-name
), price (.product-price
), image (.product-image
), and link (<a>
tag).
To get started, import the Cheerio library, which will be used to parse the HTML content.
const cheerio = require('cheerio');
Now, you should only be concerned about interacting with just the output of all 48 products. To do this, you need to clean up the previous code in the last section.
You’ll also need to bring down the html
variable after the for
loop block so you just get only one output with all 48 products.
Your clean-up code should be identical to this code snippet:
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
(async () => {
try {
const browser = await puppeteer.launch({ headless: false }); // Launch the browser
const page = await browser.newPage(); // Open a new page
// Navigate to the target website
await page.goto('https://www.scrapingcourse.com/button-click', {
waitUntil: 'networkidle2', // Wait until the network is idle
});
// Number of times to click "Load More"
const clicks = 3;
for (let i = 0; i < clicks; i++) {
try {
// Wait for the "Load More" button to be visible and click it
await page.waitForSelector('#load-more-btn', { visible: true });
await page.click('#load-more-btn');
// Wait 2 seconds for the new content to load using setTimeout
await new Promise(resolve => setTimeout(resolve, 2000));
} catch (error) {
console.log('No more "Load More" button or an error occurred:', error.message);
break; // Exit the loop if no button is found or an error occurs
}
}
// Get the final HTML content
const html = await page.content();
await browser.close(); // Close the browser
} catch (error) {
console.error('Error fetching the page:', error.message);
}
})();
Now, let’s get into the HTML parsing using Cheerio.
First of all, Cheerio needs to have access to the HTML content it wants to parse, and for that, it provides a load()
method that takes in that HTML content, making it accessible using jQuery-like syntax.
Create an instance of the Cheerio library with the HTML content:
// Load HTML into Cheerio
const $ = cheerio.load(html);
You can now use $ to query and manipulate elements in the loaded HTML.
Next, initialize an array to store the product information. This array will hold the extracted data, and each product will be stored as an object with its name
, price
, image
, and link
.
// Array to store product information
const products = [ ];
Recall that each product has a class .product-item
. You’ll use this with the variable instance of Cheerio ($) to get each product and then perform some manipulations.
The .each()
method is used to iterate through each matched element with the .product-item
class selector.
$('.product-item').each((_, element) => {
});
Let’s retrieve the product detail from each product using the class selector of that particular detail. For instance, to get the product name, you’ll need to find the child element in each product with the class selector .product-item
. Retrieve the text content of that child element and trim it in case of any whitespaces.
$('.product-item').each((_, product) => {
const name = $(product).find('.product-name').text().trim();
})
- $(element).find('.product-name'): Searches within the current .product-item for the child element with the class .product-name.
- .text(): Retrieves the text content inside the element.
- .trim(): Removes unnecessary whitespace from the text.
Leveraging this concept, let’s get the price
, image URL
, and link
using their class attribute.
$('.product-item').each((_, product) => {
const name = $(product).find('.product-name').text().trim();
const price = $(product).find('.product-price').text().trim();
const image = $(product).find('.product-image').attr('src');
const link = $(product).find('a').attr('href');
})
Now that you've all the expected information, the next thing is to push each parsed product information as an individual object to the products
array.
$('.product-item').each((_, product) => {
const name = $(product).find('.product-name').text().trim();
const price = $(product).find('.product-price').text().trim();
const image = $(product).find('.product-image').attr('src');
const link = $(product).find('a').attr('href');
products.push({
name,
price,
image,
link,
});
});
Finally, log out the products
array to get the expected output in the terminal.
console.log(`Total products parsed: ${products.length}`);
console.log(products);
Your entire code should look like this code snippet:
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
(async () => {
try {
const browser = await puppeteer.launch({ headless: false }); // Launch the browser
const page = await browser.newPage(); // Open a new page
// Navigate to the target website
await page.goto('https://www.scrapingcourse.com/button-click', {
waitUntil: 'networkidle2', // Wait until the network is idle
});
// Number of times to click "Load More"
const clicks = 3;
for (let i = 0; i < clicks; i++) {
try {
// Wait for the "Load More" button to be visible and click it
await page.waitForSelector('#load-more-btn', { visible: true });
await page.click('#load-more-btn');
// Wait 2 seconds for the new content to load using setTimeout
await new Promise(resolve => setTimeout(resolve, 2000));
} catch (error) {
console.log('No more "Load More" button or an error occurred:', error.message);
break; // Exit the loop if no button is found or an error occurs
}
}
// Get the final HTML content
const html = await page.content();
// Load HTML into Cheerio
const $ = cheerio.load(html);
// Array to store product information
const products = [];
$('.product-item').each((_, product) => {
const name = $(product).find('.product-name').text().trim();
const price = $(product).find('.product-price').text().trim();
const image = $(product).find('.product-image').attr('src');
const link = $(product).find('a').attr('href');
products.push({
name,
price,
image,
link,
});
});
console.log(`Total products parsed: ${products.length}`);
console.log(products); // Output all parsed product information
await browser.close(); // Close the browser
} catch (error) {
console.error('Error fetching the page:', error.message);
}
})();
Here’s what your output should look like when you save and run the script:
Total products parsed: 48
[
{
name: 'Chaz Kangeroo Hoodie',
price: '$52',
image: 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg',
link: 'https://scrapingcourse.com/ecommerce/product/chaz-kangeroo-hoodie'
},
{
name: 'Teton Pullover Hoodie',
price: '$70',
image: 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh02-black_main.jpg',
link: 'https://scrapingcourse.com/ecommerce/product/teton-pullover-hoodie'
},
{
name: 'Bruno Compete Hoodie',
price: '$63',
image: 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh03-black_main.jpg',
link: 'https://scrapingcourse.com/ecommerce/product/bruno-compete-hoodie'
},
{
name: 'Frankie Sweatshirt',
price: '$60',
image: 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh04-green_main.jpg',
link: 'https://scrapingcourse.com/ecommerce/product/frankie--sweatshirt'
},
// Every other product goes here, reduced to make things brief and concise
{
name: 'Zoltan Gym Tee',
price: '$29',
image: 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms06-blue_main.jpg',
link: 'https://scrapingcourse.com/ecommerce/product/zoltan-gym-tee'
}
]
Export Product Information to CSV
The next step is to export the parsed product information, which is presently in a JavaScript Object Notation (Json) format, into a Comma-Separated Values (CSV) format. We’ll use the json2csv library to convert the parsed data into its corresponding CSV format.
Start by importing the required modules.
Node.js provides the file system (fs) module for file handling, such as writing data to a file. After importing the fs
module, you should destructure the parse()
method from the json2csv
library.
const fs = require('fs');
const { parse } = require('json2csv');
CSV files usually require column headers; carefully write this in the same order as your parsed information. Here, the parsed data is the products
array, where each element is an object with four keys (name, price, image, and link). You should use these object keys to name your column headers for proper mapping.
Define the fields (Column headers) for your CSV file:
// Define CSV fields
const fields = ['name', 'price', 'image', 'link'];
Now that you’ve defined your fields, the following line of action is to convert the current parsed information to a CSV format. The parse()
method works in this format: parse(WHAT_YOU_WANT_TO_CONVERT, { YOUR_COLUMN_HEADERS }).
// Convert JSON to CSV
const csv = parse(products, { fields });
You now have to save this CSV information into a new file with the .csv file extension. When using Node.js, you can handle file creation using the writeFileSync()
method on the fs
module. This method takes two parameters: the file name and the data.
// Save CSV to a file
fs.writeFileSync('products.csv', csv);
Your complete code for this section should look like this:
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const fs = require('fs');
const { parse } = require('json2csv');
(async () => {
const browser = await puppeteer.launch({ headless: false }); // Launch Puppeteer
const page = await browser.newPage(); // Open a new page
// Navigate to the website
await page.goto('https://www.scrapingcourse.com/button-click', {
waitUntil: 'networkidle2',
});
// Click "Load More" 3 times to load all products
for (let i = 0; i < 3; i++) {
try {
await page.waitForSelector('#load-more-btn', { visible: true });
await page.click('#load-more-btn');
await new Promise(resolve => setTimeout(resolve, 2000)); // Wait for 2 seconds
} catch (error) {
console.log('No more "Load More" button or an error occurred:', error.message);
break;
}
}
// Get the final HTML content
const html = await page.content();
// Use Cheerio to parse the product data
const $ = cheerio.load(html);
const products = [];
$('.product-item').each((_, element) => {
const name = $(element).find('.product-name').text().trim();
const price = $(element).find('.product-price').text().trim();
const image = $(element).find('.product-image').attr('src');
const link = $(element).find('a').attr('href');
products.push({
name,
price,
image,
link,
});
});
console.log(`Total products parsed: ${products.length}`);
// Convert product information to CSV
try {
// Define CSV fields
const fields = ['name', 'price', 'image', 'link'];
// Convert JSON to CSV
const csv = parse(products, { fields });
// Save CSV to a file
fs.writeFileSync('products.csv', csv);
console.log('Product information exported to products.csv');
} catch (error) {
console.error('Error exporting to CSV:', error.message);
}
await browser.close(); // Close the browser
})();
You should see an automatic addition of a file named products.csv
to your file structure once you save and run the script.
Conclusion
This tutorial delved into the intricacies of scraping data from a page that requires simulation to access its hidden contents. You learned how to perform web scraping on dynamic pages using Node.js and some additional libraries, parse your scraped data into a more organized format, and unpack it into a CSV file.
Top comments (4)
This is very good!
Thanks a lot!
Thank you.
Thanks man!