DEV Community

Mikhail Zub for SerpApi

Posted on

Web scraping Google Flights with Nodejs

Currently, we don't have an API that supports extracting data from Google flights page. This blog post is to show you way how you can do it yourself with provided DIY solution.

What will be scraped

what

📌Note: The solution I'm showing you only gets flight results for the "One Way", "1 person" and "Economy" options applied.

image

If you don't need an explanation, have a look at the full code example in the online IDE

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

puppeteer.use(StealthPlugin());

const from = "Seattle";
const to = "Las Vegas";
const leaveDate = "5-15-2023"; // mm-dd-yyyy format

const URL = `https://www.google.com/travel/flights?hl=en-US&curr=USD`;

async function getFlightsFromPage(page) {
  return await page.evaluate(() =>
    Array.from(document.querySelectorAll(".pIav2d")).map((el) => {
      const thumbnailString = el.querySelector(".EbY4Pc")?.getAttribute("style");
      const startIndex = thumbnailString?.indexOf("url(");
      const endIndex = thumbnailString?.indexOf(";");
      const thumbnail = thumbnailString?.slice(startIndex + 4, endIndex - 1).replaceAll("\\", "") || "No thumbnail";
      const layover = el.querySelector(".BbR8Ec .sSHqwe")?.getAttribute("aria-label");
      return {
        thumbnail,
        companyName: el.querySelector(".Ir0Voe .sSHqwe")?.textContent.trim(),
        description: el.querySelector(".mv1WYe")?.getAttribute("aria-label"),
        duration: el.querySelector(".gvkrdb")?.textContent.trim(),
        airportLeave: el.querySelectorAll(".Ak5kof .sSHqwe .eoY5cb")[0]?.textContent.trim(),
        airportArive: el.querySelectorAll(".Ak5kof .sSHqwe .eoY5cb")[1]?.textContent.trim(),
        layover: layover || "Nonstop",
        emisions: el.querySelector(".V1iAHe > div")?.getAttribute("aria-label").replace(". Learn more about this emissions estimate", " "),
        price: el.querySelector(".U3gSDe .YMlIz > span")?.textContent.trim(),
        priceDescription: el.querySelector(".U3gSDe .JMnxgf > span > span > span")?.getAttribute("aria-label"),
      };
    })
  );
}

async function getFlightsResults() {
  const browser = await puppeteer.launch({
    headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();
  page.setViewport({
    width: 1280,
    height: 720,
  });

  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);

  await page.waitForSelector(".e5F5td");
  const inputs = await page.$$(".e5F5td");
  // type "from"
  await inputs[0].click();
  await page.waitForTimeout(1000);
  await page.keyboard.type(from);
  await page.keyboard.press("Enter");
  // type "to"
  await inputs[1].click();
  await page.waitForTimeout(1000);
  await page.keyboard.type(to);
  await page.waitForTimeout(1000);
  await page.keyboard.press("Enter");
  await page.waitForTimeout(1000);
  // type "Leave date"
  await page.click(".rIZzse .d5wCYc");
  await page.waitForTimeout(1000);
  await page.keyboard.type(leaveDate);
  await page.waitForTimeout(1000);
  await page.keyboard.press("Enter");
  await page.waitForTimeout(1000);
  // choose "One way"
  await page.click(".UGrfjc .VfPpkd-RLmnJb");
  await page.waitForTimeout(1000);
  await page.click(".VfPpkd-qPzbhe-JNdkSc > li:last-child");
  await page.waitForTimeout(1000);
  // press "Done"
  await page.click(".A8nfpe .akjk5c  .VfPpkd-vQzf8d");
  await page.waitForTimeout(1000);
  await page.keyboard.press("Enter");
  // press "Search"
  await page.waitForTimeout(1000);
  await page.keyboard.press("Enter");

  await page.waitForSelector(".pIav2d");

  const moreButton = await page.$(".XsapA");
  if (moreButton) {
    await moreButton.click();
    await page.waitForTimeout(2000);
  }

  const flights = await getFlightsFromPage(page);

  await browser.close();

  return flights;
}

getFlightsResults().then(console.log);
Enter fullscreen mode Exit fullscreen mode

Preparation

First, we need to create a Node.js* project and add npm packages puppeteer, puppeteer-extra and puppeteer-extra-plugin-stealth to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.

To do this, in the directory with our project, open the command line and enter:

$ npm init -y
Enter fullscreen mode Exit fullscreen mode

And then:

$ npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
Enter fullscreen mode Exit fullscreen mode

*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.

📌Note: also, you can use puppeteer without any extensions, but I strongly recommended use it with puppeteer-extra with puppeteer-extra-plugin-stealth to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.

stealth

For now, we complete the setup Node.JS environment for our project and move to the step-by-step code explanation.

Process

We need to extract data from HTML elements. The process of getting the right CSS selectors is fairly easy via SelectorGadget Chrome extension which able us to grab CSS selectors by clicking on the desired element in the browser. However, it is not always working perfectly, especially when the website is heavily used by JavaScript.

We have a dedicated Web Scraping with CSS Selectors blog post at SerpApi if you want to know a little bit more about them.

The Gif below illustrates the approach of selecting different parts of the results using SelectorGadget.

how

Code explanation

Declare puppeteer to control Chromium browser from puppeteer-extra library and StealthPlugin to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth library:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
Enter fullscreen mode Exit fullscreen mode

Next, we "say" to puppeteer use StealthPlugin, write the city of departure (from) and destination (to) and the date of departure (leaveDate) and the search URL:

puppeteer.use(StealthPlugin());

const from = "Seattle";
const to = "Las Vegas";
const leaveDate = "5-15-2023"; // mm-dd-yyyy format

const URL = `https://www.google.com/travel/flights?hl=en-US&curr=USD`;
Enter fullscreen mode Exit fullscreen mode

Next, we write a function to get places from the page:

async function getFlightsFromPage(page) {
  ...
}
Enter fullscreen mode Exit fullscreen mode

In this function we'll use the next methods and properties to get the necessary information:

First, we need to get thumbnail URL from the page. To do this we get the style attribute from ".EbY4Pc" selector that contains URL. Then we find startIndex that matches "url(" and endIndex that matches ";" and cut thumbnail URL:

const thumbnailString = el.querySelector(".EbY4Pc")?.getAttribute("style");
const startIndex = thumbnailString?.indexOf("url(");
const endIndex = thumbnailString?.indexOf(";");
const thumbnail = thumbnailString?.slice(startIndex + 4, endIndex - 1).replaceAll("\\", "") || "No thumbnail";
Enter fullscreen mode Exit fullscreen mode

Next, we need to get layover information if it exists:

const layover = el.querySelector(".BbR8Ec .sSHqwe")?.getAttribute("aria-label");
Enter fullscreen mode Exit fullscreen mode

Then, we get and return all flights info from the page:

return {
  thumbnail,
  companyName: el.querySelector(".Ir0Voe .sSHqwe")?.textContent.trim(),
  description: el.querySelector(".mv1WYe")?.getAttribute("aria-label"),
  duration: el.querySelector(".gvkrdb")?.textContent.trim(),
  airportLeave: el.querySelectorAll(".Ak5kof .sSHqwe .eoY5cb")[0]?.textContent.trim(),
  airportArive: el.querySelectorAll(".Ak5kof .sSHqwe .eoY5cb")[1]?.textContent.trim(),
  layover: layover || "Nonstop",
  emisions: el.querySelector(".V1iAHe > div")?.getAttribute("aria-label").replace(". Learn more about this emissions estimate", " "),
  price: el.querySelector(".U3gSDe .YMlIz > span")?.textContent.trim(),
  priceDescription: el.querySelector(".U3gSDe .JMnxgf > span > span > span")?.getAttribute("aria-label"),
};
Enter fullscreen mode Exit fullscreen mode

Next, we write a function to control the browser, and get information from each category:

async function getFlightsResults() {
  ...
}
Enter fullscreen mode Exit fullscreen mode

In this function first we need to define browser using puppeteer.launch({options}) method with current options, such as headless: true and args: ["--no-sandbox", "--disable-setuid-sandbox"].

These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page and set viewport size equal to 1280x720 pixels:

const browser = await puppeteer.launch({
  headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
  args: ["--no-sandbox", "--disable-setuid-sandbox"],
});

const page = await browser.newPage();
page.setViewport({
  width: 1280,
  height: 720,
});
Enter fullscreen mode Exit fullscreen mode

Next, we change the default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout() method, go to URL with .goto() method:

await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
Enter fullscreen mode Exit fullscreen mode

Then we wait for the ".e5F5td" selector to load (method waitForSelector()), get the input fields from and to click on each of them and enter from and to cities (method keyboard.type()), press "Enter" button (method keyboard.press()) after typing, and then enter leaveDate and the "one way ticket" option:

await page.waitForSelector(".e5F5td");
const inputs = await page.$$(".e5F5td");
// type "from"
await inputs[0].click();
await page.waitForTimeout(1000);
await page.keyboard.type(from);
await page.keyboard.press("Enter");
// type "to"
await inputs[1].click();
await page.waitForTimeout(1000);
await page.keyboard.type(to);
await page.waitForTimeout(1000);
await page.keyboard.press("Enter");
await page.waitForTimeout(1000);
// type "Leave date"
await page.click(".rIZzse .d5wCYc");
await page.waitForTimeout(1000);
await page.keyboard.type(leaveDate);
await page.waitForTimeout(1000);
await page.keyboard.press("Enter");
await page.waitForTimeout(1000);
// choose "One way"
await page.click(".UGrfjc .VfPpkd-RLmnJb");
await page.waitForTimeout(1000);
await page.click(".VfPpkd-qPzbhe-JNdkSc > li:last-child");
await page.waitForTimeout(1000);
// press "Done"
await page.click(".A8nfpe .akjk5c  .VfPpkd-vQzf8d");
await page.waitForTimeout(1000);
await page.keyboard.press("Enter");
Enter fullscreen mode Exit fullscreen mode

After filling in all fields we press the "Enter" button and wait for loads of flight results. Click on the "Show more" button and save flights results to the flights constant:

// press "Search"
await page.waitForTimeout(1000);
await page.keyboard.press("Enter");

await page.waitForSelector(".pIav2d");

const moreButton = await page.$(".XsapA");
if (moreButton) {
  await moreButton.click();
  await page.waitForTimeout(2000);
}

const flights = await getFlightsFromPage(page);
Enter fullscreen mode Exit fullscreen mode

And finally, we close the browser, and return the received data:

await browser.close();

return flights;
Enter fullscreen mode Exit fullscreen mode

Now we can launch our parser:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file
Enter fullscreen mode Exit fullscreen mode

Output

[
   {
      "thumbnail":"https://www.gstatic.com/flights/airline_logos/70px/AA.png",
      "companyName":"American",
      "description":"Leaves Seattle-Tacoma International Airport at 11:55 PM on Monday, May 15 and arrives at Harry Reid International Airport at 5:54 PM on Tuesday, May 16.",
      "duration":"17 hr 59 min",
      "airportLeave":"Seattle-Tacoma International Airport",
      "airportArive":"Harry Reid International Airport",
      "layover":"Layover (1 of 1) is a 11 hr 4 min layover at Dallas/Fort Worth International Airport in Dallas.",
      "emisions":"Carbon emissions estimate: 315 kilograms. +184% emissions ",
      "price":"$318"
   },
   {
      "thumbnail":"https://www.gstatic.com/flights/airline_logos/70px/AS.png",
      "companyName":"Alaska",
      "description":"Leaves Seattle-Tacoma International Airport at 7:10 PM on Monday, May 15 and arrives at Harry Reid International Airport at 4:36 PM on Tuesday, May 16.",
      "duration":"21 hr 26 min",
      "airportLeave":"Seattle-Tacoma International Airport",
      "airportArive":"Harry Reid International Airport",
      "layover":"Layover (1 of 1) is a 17 hr 42 min overnight layover at San Francisco International Airport in San Francisco.",
      "emisions":"Carbon emissions estimate: 176 kilograms. +59% emissions ",
      "price":"$323"
   }
   ... and other flights results
]
Enter fullscreen mode Exit fullscreen mode

If you want other functionality added to this blog post or if you want to see some projects made with SerpApi, write me a message.


Join us on Twitter | YouTube

Add a Feature Request💫 or a Bug🐞

Top comments (2)

Collapse
 
smyja profile image
Smyja

a github repository would be great. The replit doesn't work.

Collapse
 
mikhailzub profile image
Mikhail Zub

Thanks for your attention to my post. You can use the step by step instructions from this blog to replicate my code. It often happens that selectors change on the site and the parser no longer works, you need to constantly maintain its performance. In any case, this is not a commercial product, but a way of thinking and a way of parsing a particular site.