Alex Aslam

Posted on Mar 7

Web Scraping with Playwright and Python: A Developer’s Guide

#webdev #programming #python #learning

Introduction

In the world of web scraping, dynamic websites loaded with JavaScript have always posed a challenge. Enter Playwright—a powerful browser automation library by Microsoft that simplifies scraping modern, interactive websites. Combined with Python, Playwright offers a seamless way to handle even the most complex scraping tasks. In this guide, you’ll learn how to leverage Playwright for efficient and reliable web scraping.

Why Playwright?

Playwright stands out for its ability to automate Chromium, Firefox, and WebKit browsers with a single API. Unlike traditional tools like Selenium, Playwright:

Handles dynamic content effortlessly (SPAs, lazy-loaded pages).
Offers auto-waiting for elements to be ready.
Supports headless and headful modes.
Provides network interception and multi-tab browsing.

For developers, it’s a game-changer for scraping JavaScript-heavy sites like React or Angular apps.

Getting Started

1. Install Playwright

First, install Playwright’s Python package and browser binaries:

pip install playwright
playwright install

2. Launch a Browser

Start by initializing a browser instance:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)  # Set headless=True for background
    page = browser.new_page()
    page.goto("https://example.com")
    print(page.title())
    browser.close()

Basic Web Scraping Workflow

Let’s scrape product data from a demo e-commerce site.

Step 1: Navigate and Extract Data

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://webscraper.io/test-sites/e-commerce/allinone")

    # Extract product titles and prices
    products = page.query_selector_all(".thumbnail")
    for product in products:
        title = product.query_selector(".title").text_content()
        price = product.query_selector(".price").text_content()
        print(f"{title}: {price}")

    browser.close()

Step 2: Handle Dynamic Content

Use Playwright’s auto-waiting to ensure elements load:

# Wait for a selector to appear
page.wait_for_selector(".product", state="visible")

# Click a "Load More" button (if present)
page.click("button:has-text('Load More')")

Advanced Techniques

1. Handle Login Forms

Automate authenticated sessions:

page.goto("https://example.com/login")
page.fill("#username", "your_username")
page.fill("#password", "your_password")
page.click("#submit-button")

# Verify login success
page.wait_for_selector(".dashboard")

2. Intercept Network Requests

Capture API responses (e.g., XHR/fetch requests):

def handle_response(response):
    if "/api/products" in response.url:
        print(response.json())

page.on("response", handle_response)
page.goto("https://example.com/products")

3. Download Files

Automate file downloads:

with page.expect_download() as download_info:
    page.click("a.download-csv")
download = download_info.value
download.save_as("data.csv")

4. Handle IFrames

Access elements inside iframes:

iframe = page.frame_locator("iframe#content")
text = iframe.locator(".text").text_content()

Best Practices

Use Headless Mode for Speed:

   browser = p.chromium.launch(headless=True)

Avoid Detection:

Rotate user agents:

 page.set_extra_http_headers({"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})

Use browser contexts for isolated sessions:

 context = browser.new_context()
 page = context.new_page()

Rate Limiting: Add delays to mimic human behavior:

   page.wait_for_timeout(2000)  # 2-second delay

Error Handling:

   try:
       page.goto("https://unstable-site.com")
   except playwright._impl._api_types.Error as e:
       print(f"Navigation failed: {e}")

Real-World Use Cases

Price Monitoring: Track e-commerce sites for price changes.
Social Media Scraping: Extract public posts from platforms like Twitter/X (without violating ToS).
Automated Testing: Validate UI elements during development.
News Aggregation: Scrape real-time articles from news portals.

Conclusion

Playwright with Python is a robust combination for scraping modern websites. Its ability to handle dynamic content, automate interactions, and avoid detection makes it ideal for developers tackling complex scraping projects.

Next Steps:

Explore Playwright’s official documentation.
Experiment with parallel scraping using browser contexts.
Integrate proxies for large-scale scraping.

Pro Tip: Always respect robots.txt and a website’s terms of service. When in doubt, reach out for permission!

Happy scraping! 🚀

DEV Community