DEV Community

Scrapfly
Scrapfly

Posted on

Web Scraping with Playwright and JavaScript

Web Scraping with Playwright and JavaScript

Web scraping unlocks the potential to extract valuable data from websites, and Playwright is a game-changer for automating this process. With support for multiple environments like Node.js, Deno, and Bun, Playwright makes web scraping accessible for beginners and powerful for pros.

In this blog, we'll explore how to use Playwright for web scraping in different environments like Node.js, Deno, and Bun.

What is Playwright?

Playwright is a powerful tool developed by Microsoft for automating browser actions across different browsers such as Chromium, Firefox, and WebKit. It allows developers to interact with web pages programmatically, making it an ideal choice for web scraping tasks.

Key Features of Playwright

Playwright is packed with features that make browser automation seamless and efficient. Here’s a closer look at its standout features:

  • Cross-Browser Support: Automates Chromium, Firefox, and WebKit for seamless compatibility across platforms.
  • Headless and Headful Modes: Offers fast headless mode and visual debugging in headful mode.
  • Multi-Language Support: Provides libraries for JavaScript, Python, Java, and .NET.
  • Network Interception: Inspects, modifies, or blocks requests and responses for advanced control.
  • Screenshots and Videos: Captures visuals of browser sessions for debugging and documentation.
  • Secure Automation: Supports proxy use, secure storage, and CAPTCHA handling.

With its extensive feature set and robust design, Playwright stands out as a versatile and developer-friendly tool for automating browsers and handling complex web scraping or testing requirements.

Setup

Getting started with Playwright is straightforward. Before diving into web scraping, you’ll need to set it up in your preferred development environment. Playwright supports multiple JavaScript runtimes, making it flexible and accessible. Here's how to set it up for Node.js, Deno, and Bun:

NodeJS

Deno

Bun

# Install Playwright using your favorite package manager, such as npm or Yarn.
# This is the most common setup and works seamlessly with Node.js projects.
npm install playwright
Enter fullscreen mode Exit fullscreen mode
// Deno allows you to import Playwright directly from the Deno registry.
// There's no need for additional installation steps, making setup quick and easy.
import * as playwright from "https://deno.land/x/playwright@1.22.1/mod.ts";
Enter fullscreen mode Exit fullscreen mode
# Bun, known for its performance and speed, also supports Playwright.
# Use Bun’s package manager to add it to your project effortlessly.
bun add playwright
Enter fullscreen mode Exit fullscreen mode

With clear and straightforward commands tailored to each runtime, Playwright ensures that developers can get started quickly, regardless of their chosen environment. Choose your runtime, follow the steps above, and you’ll be ready to scrape the web in no time!

Next, we’ll explore how to use Playwright in a REPL (Read-Evaluate-Print Loop) environment.

Tip: Playwright in REPL

Playwright can also be used in a REPL (Read-Evaluate-Print Loop) environment, which is great for quick testing. Here's how to set it up in different environments:

To run Playwright in a JavaScript REPL environment across different runtimes like Node.js, Deno, and Bun, follow these steps:

Node.js

To use Playwright in a Node.js environment, you can take advantage of its REPL (Read-Evaluate-Print Loop) capabilities for quick testing and prototyping. Follow the steps below to set up and explore Playwright interactively:

  1. Start the REPL with top-level await support:

    $ node --experimental-repl-await
    
  2. In the REPL:

    const { chromium } = require("playwright");
    const browser = await chromium.launch({ headless: false });
    const page = await browser.newPage();
    await page.goto("https://twitch.tv");
    

This snippet launches a visible Chromium browser, opens Twitch's homepage, and lets you test and debug scripts interactively in a REPL without a separate script file.

Deno

In Deno, you can seamlessly use Playwright by importing it directly from the Deno registry. The REPL (Read-Evaluate-Print Loop) allows you to test and execute Playwright commands interactively with the required permissions. Follow these steps to get started:

  1. Start the Deno REPL with necessary permissions:

    $ deno repl --allow-net --allow-env --allow-run
    
  2. In the REPL:

    const { chromium } = await import(
      "https://deno.land/x/playwright@1.22.1/mod.ts"
    );
    const browser = await chromium.launch({ headless: false });
    const page = await browser.newPage();
    await page.goto("https://twitch.tv");
    

    In Deno, permissions like --allow-net, --allow-env, and --allow-run are necessary for Playwright to function correctly.

Bun

Bun provides a fast and efficient environment for using Playwright. By leveraging its REPL (Read-Evaluate-Print Loop), you can quickly test and execute Playwright commands interactively. Follow these steps to get started:

  1. Start the Bun REPL:

    $ bun repl
    
  2. In the REPL:

    const { chromium } = await import("playwright");
    const browser = await chromium.launch({ headless: false });
    const page = await browser.newPage();
    await page.goto("https://twitch.tv");
    

Using Playwright in a REPL environment allows you to experiment and test automation scripts interactively across different runtimes like Node.js, Deno, and Bun. This flexibility makes it an excellent choice for rapid prototyping and debugging.

Basics

Before diving into Playwright’s features, let’s start by launching a browser, creating a new context, and opening a browser tab (referred to as a "page"). These are fundamental steps for any web scraping or automation task in Playwright.

const { chromium } = require("playwright");

(async () => {
  const browser = await chromium.launch({
    // Choose headless mode for speed or headful mode for debugging
    headless: false,
  });

  // Create a new browser context with custom settings
  const context = await browser.newContext({
    // Set viewport dimensions to match a common desktop resolution
    viewport: { width: 1920, height: 1080 },
  });

  // Open a new page (tab) in the browser
  const page = await context.newPage();

  // Now, we can use the page object for all our automation tasks
})();
Enter fullscreen mode Exit fullscreen mode

Once we have a browser and page ready, we can explore Playwright’s core features, which cover everything needed for efficient web scraping:

  • Navigation: Load a webpage using page.goto().
  • Button Clicking: Simulate clicks on buttons or links.
  • Text Input: Input text into forms or search boxes.
  • JavaScript Execution: Run custom JavaScript within the page.
  • Waiting for Content: Ensure elements are fully loaded before interacting.

Let’s use Playwright to scrape dynamic video data from Twitch's Art section. Here’s what we’ll accomplish:

  1. Start a browser instance, create a context, and open a page.
  2. Navigate to https://twitch.tv/directory/game/Art.
  3. Wait for the page to load completely.
  4. Parse and extract dynamic data, such as stream titles, viewer counts, and creator details.

Let’s break it down step by step, starting with navigation and waiting.

Navigation and Waiting

To navigate we can use page.goto() function which will direct the browser to any URL:

const { chromium } = require("playwright");

(async () => {
  const browser = await chromium.launch({ headless: false });
  const context = await browser.newContext({
    viewport: { width: 1920, height: 1080 },
  });
  const page = await context.newPage();

  try {
    // Navigate to the URL
    await page.goto("https://twitch.tv/directory/game/Art");

    const content = await page.content();

    console.log(content);
  } catch (error) {
    console.error("An error occurred:", error);
  } finally {
    // Close the browser
    await browser.close();
  }
})();
Enter fullscreen mode Exit fullscreen mode

However, for javascript-heavy websites like twitch.tv our page.content() code might return data prematurely before everything is loaded.

To ensure that doesn't happen we can wait for a particular element to appear on the page. In other words, if the list of videos is present on the page then we can safely assume the page has loaded:

await page.goto("https://twitch.tv/directory/game/Art");
// wait for first result to appear
await page.waitForSelector("div[data-target=directory-first-item]");
// retrieve final HTML content
console.log(await page.content());
Enter fullscreen mode Exit fullscreen mode

Above, we used page.waitForSelector() function to wait for an element defined by our CSS selector to appear on the page.

Parsing Data

Since Playwright uses a real web browser with javascript environment we can use the browser's HTML parsing capabilities. In Playwright this is implemented through locators feature:

(async () => {
  const browser = await chromium.launch({ headless: false });
  const context = await browser.newContext({
    viewport: { width: 1920, height: 1080 },
  });
  const page = await context.newPage();

  try {
    await page.goto("https://twitch.tv/directory/game/Art"); // Go to the URL

    await page.waitForSelector('div[data-target="directory-first-item"]'); // Wait for the content to load

    // Locate all stream boxes
    const streamBoxes = await page
      .locator('//div[contains(@class,"tw-tower")]/div[@data-target]')
      .elementHandles();

    // Parse data from each stream box
    const parsed = [];
    for (const box of streamBoxes) {
      const title = await box.$eval("h3", (el) => el.innerText);
      const url = await box.$eval(".tw-link", (el) => el.getAttribute("href"));
      const username = await box.$eval(".tw-link", (el) => el.innerText);
      const viewers = await box.$eval(
        ".tw-media-card-stat",
        (el) => el.innerText
      );
      const tagsElement = await box.$(".tw-tag");
      // tags are not always present:
      const tags = tagsElement ? await tagsElement.innerText() : null;

      parsed.push({
        title,
        url,
        username,
        viewers,
        tags,
      });
    }

    for (const video of parsed) {
      console.log(video);
    }
  } catch (error) {
    console.error("An error occurred:", error);
  } finally {
    await browser.close(); // Close the browser
  }
})();
Enter fullscreen mode Exit fullscreen mode

Example Output

[
{
  title: '✖ first stream of the new year YIPPIE ( •̀ᴗ•́  ̑ | !kofi !merch',
  url: '/littlemisstina',
  username: '✖ first stream of the new year YIPPIE ( •̀ᴗ•́  ̑ | !kofi !merch\n' +
    '\n' +
    'LittleMissTina',
  viewers: '751 viewers',
  tags: 'ENVtuber'
}
{
  title: '♡ Short early stream | ( ͡° ͜ʖ ͡°))| !socials !discord !boosty !domestika',
  url: '/dzikawa',
  username: '♡ Short early stream | ( ͡° ͜ʖ ͡°))| !socials !discord !boosty !domestika\n' +
    '\n' +
    'Dzikawa',
  viewers: '122 viewers',
  tags: 'digital'
}
    ...
  ]
Enter fullscreen mode Exit fullscreen mode

In the code above, we selected each result box using XPath selectors and extracted details from within it using CSS selectors.

Clicking Buttons and Text Input

To explore click and text input let's extend our twitch.tv scraper with search functionality:

  1. We'll go to twitch.tv
  2. Select the search box and input a search query
  3. Click the search button or press Enter
  4. Wait for the content to load
  5. Parse results

In playwright to interact with the web components we can use the same locator functionality we used in parsing:

(async () => {
  const browser = await chromium.launch({ headless: false });
  const context = await browser.newContext({
    viewport: { width: 1920, height: 1080 },
  });
  const page = await context.newPage();

  try {
    // Go to the Twitch Art directory
    await page.goto("https://www.twitch.tv/directory/game/Art");

    // Find the search box and enter the query
    const searchBox = page.locator('input[autocomplete="twitch-nav-search"]');
    await searchBox.type("Painting", { delay: 100 });

    // Press Enter to submit the search
    await searchBox.press("Enter");

    // Alternatively, click the search button explicitly
    // const searchButton = page.locator('button[aria-label="Search Button"]');
    // await searchButton.click();

    // Click on the "Tagged Channels" link
    await page.locator('.search-results .tw-link[href*="all/tags"]').click();

    // Wait for the results to load
    await page.waitForSelector("div[data-target]");

    // Parse the results
    const parsed = [];
    const streamBoxes = await page
      .locator('//div[contains(@class,"tw-tower")]/div[@data-target]')
      .elementHandles();

    for (const box of streamBoxes) {
      const title = await box.$eval("h3", (el) => el.innerText.trim());
      const url = await box.$eval(".tw-link", (el) => el.getAttribute("href"));
      const username = await box.$eval(".tw-link", (el) => el.innerText.trim());
      const viewers = await box.$eval(".tw-media-card-stat", (el) =>
        el.innerText.trim()
      );
      const tagsElement = await box.$(".tw-tag");
      const tags = tagsElement ? await tagsElement.innerText() : null;

      parsed.push({
        title,
        url,
        username,
        viewers,
        tags,
      });
    }

    // Print the parsed data
    console.log(parsed);
  } catch (error) {
    console.error("An error occurred:", error);
  } finally {
    // Close the browser
    await browser.close();
  }
})();
Enter fullscreen mode Exit fullscreen mode

Note: playwright's locator doesn't allow selectors that result in multiple values. It wouldn't know which one to click. Meaning, our selectors must be unique to one element we want to interact with.

We got search functionality working and extracted the first page of the results, though how do we get the rest of the pages? For this we'll need scrolling functionality - let's take a look at it.

Scrolling and Infinite Pagination

The stream results section of twitch.tv is using infinite scrolling pagination. To retrieve the rest of the results in our Playwright scraper we need to continuously scroll to the last result visible on the page to trigger new page loads.

We could do this by scrolling to the bottom of the entire page but that doesn't always work in headless browsers. A better way is to find all elements and scroll the last one into view expliclitly.

In playwright, this can be done by using locators and scrollIntoViewIfNeeded() function. We'll keep scrolling the last result into view to trigger the next page loading until no more new results appear:

(async () => {
  const browser = await chromium.launch({ headless: false });
  const context = await browser.newContext({ viewport: { width: 1920, height: 1080 } });
  const page = await context.newPage();

  try {
    // Go to the Twitch Art directory
    await page.goto('https://www.twitch.tv/directory/game/Art');

    // Wait for the content to fully load
    await page.waitForSelector('div[data-target="directory-first-item"]');

    // Loop scrolling until no more new elements are loaded
    let streamBoxes;
    while (true) {
      streamBoxes = await page.locator('//div[contains(@class,"tw-tower")]/div[@data-target]').elementHandles();
      await streamBoxes[streamBoxes.length - 1].scrollIntoViewIfNeeded();

      const itemsOnPage = streamBoxes.length;
      await page.waitForTimeout(2000); // Wait for new items to load

      const itemsOnPageAfterScroll = (await page.locator('//div[contains(@class,"tw-tower")]/div[@data-target]').elementHandles()).length;

      if (itemsOnPageAfterScroll > itemsOnPage) {
        continue; // More items loaded - keep scrolling
      } else {
        break; // No more items - break scrolling loop
      }
    }

    // Parse the data
    const parsed = [];
    for (const box of streamBoxes) {
        ...
Enter fullscreen mode Exit fullscreen mode

In the example code above, we will continuously trigger new result loading until the pagination end is reached. In this case, our code should generate hundreds of parsed results.

Advanced Functions

We've covered the most common playwright features used in web scraping: navigation, waiting, clicking, typing and scrolling. However, there are a few advanced features that come in handy scraping more complex web scraping targets.

Evaluating Javascript

Playwright can evaluate any javacript code in the context of the current page. Using javascript we can do everything we did before like navigating, clicking and scrolling and even more! In fact, many of these playwright functions are implemented through javascript evaluation.

For example, if the built-in scrolling is failing us we can define our own scrolling javascript function and submit it to Playwright:

await page.evaluate(() => {
  const items = document.querySelectorAll(".tw-tower > div");
  if (items.length > 0) {
    items[items.length - 1].scrollIntoView({
      behavior: "smooth",
      block: "end",
      inline: "end",
    });
  }
});
Enter fullscreen mode Exit fullscreen mode

The above code will scroll the last result into view just like previously but it'll scroll smoothly and to the very edge of the object. This approach is more likely to trigger next page loading compared to Playwright's scrollIntoViewIfNeeded function.

Javascript evaluation is a powerful feature that can be used to scrape complex web apps as it gives us full control of the browser's capabilities through javascript.

Request and Response Intercepting

Playwright tracks all of the background requests and responses the browser sends and receives. In web scraping, we can use this to modify background requests or collect secret data from background responses:

const { chromium } = require("playwright");

// Function to intercept requests
const interceptRequest = (request) => {
  // Update requests with custom headers
  if (request.url().includes("secret")) {
    request.headers({ "x-secret-token": "123" });
    console.log("patched headers of a secret request");
  }
  // Adjust sent data for POST requests
  if (request.method() === "POST") {
    request.postData("patched");
    console.log("patched POST request");
  }
};

// Function to intercept responses
const interceptResponse = (response) => {
  // Extract details from background requests
  if (response.request().resourceType() === "xhr") {
    console.log(response.headers()["cookie"]);
  }
};

(async () => {
  const browser = await chromium.launch({ headless: false });
  const context = await browser.newContext({
    viewport: { width: 1920, height: 1080 },
  });
  const page = await context.newPage();

  // Enable intercepting for this page
  page.on("request", interceptRequest);
  page.on("response", interceptResponse);

  // Navigate to the Twitch Art directory
  await page.goto("https://www.twitch.tv/directory/game/Art");
  await page.waitForSelector('div[data-target="directory-first-item"]');

  // Close the browser
  await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

In the example above, we define our interceptor functions and attach them to our playwright page. This will allow us to inspect and modify every background and foreground request the browser makes.

Blocking Resources

Web scraping using headless browsers is really bandwidth intensive. The browser is downloading all of the images, fonts and other expensive resources our web scraper doesn't care about. To optimize this we can configure our Playwright instance to block these unnecessary resources:

const { chromium } = require("playwright");

// Block pages by resource type (e.g., image, stylesheet)
const BLOCK_RESOURCE_TYPES = [
  "beacon",
  "csp_report",
  "font",
  "image",
  "imageset",
  "media",
  "object",
  "texttrack",
  // We can even block stylesheets and scripts, though it's not recommended:
  // 'stylesheet',
  // 'script',
  // 'xhr',
];

// Block popular third-party resources like tracking
const BLOCK_RESOURCE_NAMES = [
  "adzerk",
  "analytics",
  "cdn.api.twitter",
  "doubleclick",
  "exelator",
  "facebook",
  "fontawesome",
  "google",
  "google-analytics",
  "googletagmanager",
];

// Function to intercept and block requests
const interceptRoute = (route) => {
  const request = route.request();

  // Block by resource type
  if (BLOCK_RESOURCE_TYPES.includes(request.resourceType())) {
    console.log(
      `Blocking background resource: ${request.url()} (blocked type: ${request.resourceType()})`
    );
    return route.abort();
  }

  // Block by resource name (URL)
  if (BLOCK_RESOURCE_NAMES.some((key) => request.url().includes(key))) {
    console.log(
      `Blocking background resource: ${request.url()} (blocked name)`
    );
    return route.abort();
  }

  // Continue all other requests
  return route.continue();
};

(async () => {
  const browser = await chromium.launch({
    headless: false,
    // Enable devtools to see total resource usage
    devtools: true,
  });
  const context = await browser.newContext({
    viewport: { width: 1920, height: 1080 },
  });
  const page = await context.newPage();

  // Enable intercepting for all requests
  await page.route("**/*", interceptRoute);

  // Navigate to the Twitch Art directory
  await page.goto("https://www.twitch.tv/directory/game/Art");
  await page.waitForSelector('div[data-target="directory-first-item"]');

  // Close the browser
  await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

In the example above, we are defining an interception rule which tells Playwright to drop any unwanted background resource requests that are either of ignored type or contain ignored phrases in the URL (like google analytics).

Avoiding Blocking

Although Playwright uses a real browser, websites can still detect automated behavior through techniques like JavaScript fingerprinting and variable monitoring. These methods can reveal whether a browser is controlled by a human or an automation toolkit.

For more on this see our extensive article covering javascript fingerprinting and variable leaking:

[

How Javascript is Used to Block Web Scrapers? In-Depth Guide

Introduction to javascript fingerprinting and how to fortify automated web browsers against it.

How Javascript is Used to Block Web Scrapers? In-Depth Guide

](https://scrapfly.io/blog/how-to-avoid-web-scraping-blocking-javascript/#fortifying-browsers)

ScrapFly's Alternative

Playwright is a powerful web scraping tool however it can be difficult to scale up and handle in some web scraping scenarios and this is where Scrapfly can be of assistance!

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

scrapfly middleware

Using ScrapFly SDK we can replicate the same actions we did in Playwright:

import { ScrapflyClient, ScrapeConfig } from "scrapfly-sdk";

const client = new ScrapflyClient({ key: "YOUR SCRAPFLY KEY" });

// We can use a browser to render the page, screenshot it and return final HTML
result = await client.scrape(
  new ScrapeConfig({
    url: 'https://www.twitch.tv/directory/game/Art',
    // enable browser rendering
    render_js: True,
    // we can wait for specific part to load just like with Playwright:
    wait_for_selector: "div[data-target=directory-first-item]",
    // we can capture screenshots
    screenshots: {"everything": "fullpage"},
    // for targets that block scrapers we can enable block bypass:
    asp: True
  }),
);

// It's also possible to execute complex javascript scenarios like button clicking
// and text typing:

result = client.scrape(new ScrapeConfig({
  url: "https://www.twitch.tv/directory/game/Art",
  // enable browser rendering
  wait_for_selector: "div[data-target=directory-first-item]",
  render_js: true,
  js_scenario: [
      // wait to load
      {"wait_for_selector": {"selector": 'input[autocomplete="twitch-nav-search"]'}},
      // input search
      {"fill": {"value": "watercolor", "selector": 'input[autocomplete="twitch-nav-search"]'}},
      // click search button
      {"click": {"selector": 'button[aria-label="Search Button"]'}},
      // wait explicit amount of time
      {"wait_for_navigation": {"timeout": 2000}}
  ]
}));
Enter fullscreen mode Exit fullscreen mode

Just like with Playwright we can control a web browser to navigate the website, click buttons, input text and return the final rendered HTML to us for parsing.

FAQ

To wrap this introduction up let's take a look at some frequently asked questions regarding web scraping with Playwright in JavaScript:

How to Use a Proxy with Playwright in JavaScript?

You can assign a proxy server per browser instance in Playwright using JavaScript. This is useful for web scraping when you need to rotate IPs or access region-specific content:


const { chromium } = require("playwright");

(async () => {
  const browser = await chromium.launch({
    headless: true, // Set to false if you want a visible browser
    proxy: { server: "11.11.11.1:9000" }, // Proxy server configuration
    // Optional: Add authentication
    // proxy: { server: '11.11.11.1:9000', username: 'user', password: 'pass' },
  });

  const page = await browser.newPage();
  await page.goto("https://example.com");

  // Scrape or interact with the page here

  await browser.close();
})();

Enter fullscreen mode Exit fullscreen mode

How Can I Use a Proxy with Playwright?

You can assign a proxy for your browser instance when using Playwright. This is particularly useful for scraping region-specific content or avoiding IP bans. Proxies can be configured with or without authentication, depending on your needs.

Which Headless Browser Is Best for JavaScript Playwright Scraping?

Chromium offers the best performance and is the most widely supported browser for JavaScript-based scraping, making it a popular choice for most use cases.

On the other hand, Firefox can be particularly useful for bypassing captchas and avoiding detection, as it is less commonly used in web scraping and therefore less likely to trigger anti-scraping measures.

Summary

This guide explored Playwright with javascript, a browser automation toolkit for web scraping. Key highlights include:

  • Core Features: Navigation, button clicking, text input, scrolling, and data parsing.
  • Real-Life Example: Scraping Twitch.tv's Art category for titles, viewers, and creator details.
  • Advanced Tools: Resource blocking, request interception, and JavaScript evaluation.
  • Infinite Scrolling: Techniques for handling dynamic content loading.
  • Proxy Use: Bypassing geographical restrictions and anti-scraping measures.
  • Alternative: ScrapFly for large-scale scraping with anti-bot protection and JavaScript rendering.

With Playwright’s feature-rich toolkit, you can efficiently scrape modern, dynamic websites while overcoming common challenges like CAPTCHAs and content loading delays.

Top comments (0)