What will be scraped
If you don't need an explanation, have a look at the full code example in the online IDE
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());
const searchQuery = "Honolulu";
const URL = `https://www.google.com/travel/things-to-do`;
async function getPlaces(page) {
const height = await page.evaluate(() => document.querySelector(".zQTmif").scrollHeight);
const scrollIterationCount = 10;
for (let i = 0; i < scrollIterationCount; i++) {
await page.mouse.wheel({ deltaY: height / scrollIterationCount });
await page.waitForTimeout(2000);
}
return await page.evaluate(() =>
Array.from(document.querySelectorAll(".f4hh3d")).map((el) => ({
thumbnail: el.querySelector(".kXlUEb img")?.getAttribute("src") || "No thumbnail",
title: el.querySelector(".GwjAi .skFvHc")?.textContent.trim(),
description: el.querySelector(".GwjAi .nFoFM")?.textContent.trim() || "No description",
rating: parseFloat(el.querySelector(".GwjAi .KFi5wf")?.textContent.trim()) || "No rating",
reviews:
parseInt(
el
.querySelector(".GwjAi .jdzyld")
?.textContent.trim()
.replace(/[\(|\)|\s]/gm, "")
) || "No reviews",
}))
);
}
async function getThingsToDoResults() {
const browser = await puppeteer.launch({
headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
const page = await browser.newPage();
await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
await page.waitForSelector("[type='text']");
await page.click("[type='text']");
await page.waitForTimeout(1000);
await page.keyboard.type(searchQuery);
await page.waitForTimeout(1000);
await page.keyboard.press("Enter");
await page.waitForSelector(".GtiGue button");
await page.click(".GtiGue button");
await page.waitForSelector(".f4hh3d");
await page.waitForTimeout(2000);
const options = Array.from(await page.$$(".iydyUc"));
const places = {
all: await getPlaces(page),
};
for (const option of options) {
await option.click();
await page.waitForSelector(".f4hh3d");
await page.waitForTimeout(2000);
const optionName = await option.$eval(".m1GHmf", (node) => node.textContent.trim());
places[`${optionName}`] = await getPlaces(page);
}
await browser.close();
return places;
}
getThingsToDoResults().then(console.log);
Preparation
First, we need to create a Node.js* project and add npm
packages puppeteer
, puppeteer-extra
and puppeteer-extra-plugin-stealth
to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.
To do this, in the directory with our project, open the command line and enter:
$ npm init -y
And then:
$ npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.
πNote: also, you can use puppeteer
without any extensions, but I strongly recommended use it with puppeteer-extra
with puppeteer-extra-plugin-stealth
to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.
For now, we complete the setup Node.JS environment for our project and move to the step-by-step code explanation.
Process
We need to extract data from HTML elements. The process of getting the right CSS selectors is fairly easy via SelectorGadget Chrome extension which able us to grab CSS selectors by clicking on the desired element in the browser. However, it is not always working perfectly, especially when the website is heavily used by JavaScript.
We have a dedicated Web Scraping with CSS Selectors blog post at SerpApi if you want to know a little bit more about them.
The Gif below illustrates the approach of selecting different parts of the results using SelectorGadget.
Code explanation
Declare puppeteer
to control Chromium browser from puppeteer-extra
library and StealthPlugin
to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth
library:
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
Next, we "say" to puppeteer
use StealthPlugin
, write the search query and the search URL:
puppeteer.use(StealthPlugin());
const searchQuery = "Honolulu";
const URL = `https://www.google.com/travel/things-to-do`;
Next, we write a function to get places from the page:
async function getPlaces(page) {
...
}
In this function we'll use the next methods and properties to get the necessary information:
-
evaluate()
; -
querySelector()
; -
scrollHeight
; -
mouse.wheel()
; -
waitForTimeout()
; -
querySelectorAll()
; -
getAttribute()
; -
textContent
; -
Array.from()
; -
parseFloat()
; -
parseInt()
; -
replace()
;
First, we need to scroll the page for loads all thumbnails. To do this we get the page scrollHeight, define scrollIterationCount
(you need to greater this value if not all thumbnails are loaded), and then scroll the page in the for
loop:
const height = await page.evaluate(() => document.querySelector(".zQTmif").scrollHeight);
const scrollIterationCount = 10;
for (let i = 0; i < scrollIterationCount; i++) {
await page.mouse.wheel({ deltaY: height / scrollIterationCount });
await page.waitForTimeout(2000);
}
Then, we get and return all places info from the page (using evaluate()
method):
return await page.evaluate(() =>
Array.from(document.querySelectorAll(".f4hh3d")).map((el) => ({
thumbnail: el.querySelector(".kXlUEb img")?.getAttribute("src") || "No thumbnail",
title: el.querySelector(".GwjAi .skFvHc")?.textContent.trim(),
description: el.querySelector(".GwjAi .nFoFM")?.textContent.trim() || "No description",
rating: parseFloat(el.querySelector(".GwjAi .KFi5wf")?.textContent.trim()) || "No rating",
reviews:
parseInt(
el
.querySelector(".GwjAi .jdzyld")
?.textContent.trim()
.replace(/[\(|\)|\s]/gm, "") // this RegEx matches "(", or ")", or any white space
) || "No reviews",
}))
);
Next, we write a function to control the browser, and get information from each category:
async function getThingsToDoResults() {
...
}
In this function first we need to define browser
using puppeteer.launch({options})
method with current options
, such as headless: true
and args: ["--no-sandbox", "--disable-setuid-sandbox"]
.
These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page
:
const browser = await puppeteer.launch({
headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
const page = await browser.newPage();
Next, we change the default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout()
method, go to URL
with .goto()
method:
await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
Then we wait until "[type='text']"
selector is load (waitForSelector()
method), click on this input field and type searchQuery
(keyboard.type()
method), press "Enter" button (keyboard.press()
method) and then click on "See all top sights" button:
await page.waitForSelector("[type='text']");
await page.click("[type='text']");
await page.waitForTimeout(1000);
await page.keyboard.type(searchQuery);
await page.waitForTimeout(1000);
await page.keyboard.press("Enter");
await page.waitForSelector(".GtiGue button");
await page.click(".GtiGue button");
await page.waitForSelector(".f4hh3d");
await page.waitForTimeout(2000);
Then we define the places
object and add places information from the page to all
key:
const places = {
all: await getPlaces(page),
};
Next, we need to get all categories
from the page and get all the places information from each category by clicking on each and setting to places
object keys with the category name:
const categories = Array.from(await page.$$(".iydyUc"));
for (const category of categories) {
await category.click();
await page.waitForSelector(".f4hh3d");
await page.waitForTimeout(2000);
const categoryName = await category.$eval(".m1GHmf", (node) => node.textContent.trim());
places[`${categoryName}`] = await getPlaces(page);
}
And finally, we close the browser, and return the received data:
await browser.close();
return places;
Now we can launch our parser:
$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file
Output
{
"all": [
{
"thumbnail": "https://encrypted-tbn1.gstatic.com/licensed-image?q=tbn:ANd9GcSNARkYcqi7DBwaNx9w-qMSlFVL_nYNTuu0bX8zgIswYAjlyIx9oIpilLInYWdr7xWXGdy2zSTyhYnO_GjbBYhOJQ",
"title": "Tonggs Beach",
"description": "Surfing and beach",
"rating": 4.4,
"reviews": 68
},
{
"thumbnail": "https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRMnzB_-HjKVPLtoD-QSTeLiLbxb87JCCaKmiI_179MO1zj1uRo30CQ41icaJrOEihrQQYwXFpvojMpEg",
"title": "Kaluahole Beach",
"description": "Beach",
"rating": 2.5,
"reviews": 2
},
...and other places
],
"History": [
{
"thumbnail": "https://encrypted-tbn2.gstatic.com/licensed-image?q=tbn:ANd9GcRlsOO0zJJJhXHxJdoms3a0VSDHdTSOlARXlcyBI7THZ64LnuaSAuBdlvYYxliXdo8fO666Fu3QSisgG-cWt9pt-Q",
"title": "Pearl Harbor Aviation Museum",
"description": "Exhibits on WWII aviation in the Pacific",
"rating": 4.6,
"reviews": 4
},
{
"thumbnail": "https://encrypted-tbn1.gstatic.com/licensed-image?q=tbn:ANd9GcSgKRnVx6y-cH0Jq-h64UDAc50iwHHMOaARxnQN8xH2n_CBGIMSgQM0QGTs_qZWY65VS0sOtmgLEN9rI87k03MQiA",
"title": "Bishop Museum",
"description": "Polynesian culture & natural history",
"rating": 4.6,
"reviews": 3
},
...and other places
],
...and other categories
}
If you want other functionality added to this blog post or if you want to see some projects made with SerpApi, write me a message.
Add a Feature Requestπ« or a Bugπ
Top comments (0)