DEV Community

Alex Aslam
Alex Aslam

Posted on

Web Scraping with Playwright and Python: A Developer’s Guide

Introduction

In the world of web scraping, dynamic websites loaded with JavaScript have always posed a challenge. Enter Playwright—a powerful browser automation library by Microsoft that simplifies scraping modern, interactive websites. Combined with Python, Playwright offers a seamless way to handle even the most complex scraping tasks. In this guide, you’ll learn how to leverage Playwright for efficient and reliable web scraping.


Why Playwright?

Playwright stands out for its ability to automate Chromium, Firefox, and WebKit browsers with a single API. Unlike traditional tools like Selenium, Playwright:

  • Handles dynamic content effortlessly (SPAs, lazy-loaded pages).
  • Offers auto-waiting for elements to be ready.
  • Supports headless and headful modes.
  • Provides network interception and multi-tab browsing.

For developers, it’s a game-changer for scraping JavaScript-heavy sites like React or Angular apps.


Getting Started

1. Install Playwright

First, install Playwright’s Python package and browser binaries:

pip install playwright
playwright install
Enter fullscreen mode Exit fullscreen mode

2. Launch a Browser

Start by initializing a browser instance:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)  # Set headless=True for background
    page = browser.new_page()
    page.goto("https://example.com")
    print(page.title())
    browser.close()
Enter fullscreen mode Exit fullscreen mode

Basic Web Scraping Workflow

Let’s scrape product data from a demo e-commerce site.

Step 1: Navigate and Extract Data

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://webscraper.io/test-sites/e-commerce/allinone")

    # Extract product titles and prices
    products = page.query_selector_all(".thumbnail")
    for product in products:
        title = product.query_selector(".title").text_content()
        price = product.query_selector(".price").text_content()
        print(f"{title}: {price}")

    browser.close()
Enter fullscreen mode Exit fullscreen mode

Step 2: Handle Dynamic Content

Use Playwright’s auto-waiting to ensure elements load:

# Wait for a selector to appear
page.wait_for_selector(".product", state="visible")

# Click a "Load More" button (if present)
page.click("button:has-text('Load More')")
Enter fullscreen mode Exit fullscreen mode

Advanced Techniques

1. Handle Login Forms

Automate authenticated sessions:

page.goto("https://example.com/login")
page.fill("#username", "your_username")
page.fill("#password", "your_password")
page.click("#submit-button")

# Verify login success
page.wait_for_selector(".dashboard")
Enter fullscreen mode Exit fullscreen mode

2. Intercept Network Requests

Capture API responses (e.g., XHR/fetch requests):

def handle_response(response):
    if "/api/products" in response.url:
        print(response.json())

page.on("response", handle_response)
page.goto("https://example.com/products")
Enter fullscreen mode Exit fullscreen mode

3. Download Files

Automate file downloads:

with page.expect_download() as download_info:
    page.click("a.download-csv")
download = download_info.value
download.save_as("data.csv")
Enter fullscreen mode Exit fullscreen mode

4. Handle IFrames

Access elements inside iframes:

iframe = page.frame_locator("iframe#content")
text = iframe.locator(".text").text_content()
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Use Headless Mode for Speed:
   browser = p.chromium.launch(headless=True)
Enter fullscreen mode Exit fullscreen mode
  1. Avoid Detection:

    • Rotate user agents:
     page.set_extra_http_headers({"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})
    
  • Use browser contexts for isolated sessions:

     context = browser.new_context()
     page = context.new_page()
    
  1. Rate Limiting: Add delays to mimic human behavior:
   page.wait_for_timeout(2000)  # 2-second delay
Enter fullscreen mode Exit fullscreen mode
  1. Error Handling:
   try:
       page.goto("https://unstable-site.com")
   except playwright._impl._api_types.Error as e:
       print(f"Navigation failed: {e}")
Enter fullscreen mode Exit fullscreen mode

Real-World Use Cases

  1. Price Monitoring: Track e-commerce sites for price changes.
  2. Social Media Scraping: Extract public posts from platforms like Twitter/X (without violating ToS).
  3. Automated Testing: Validate UI elements during development.
  4. News Aggregation: Scrape real-time articles from news portals.

Conclusion

Playwright with Python is a robust combination for scraping modern websites. Its ability to handle dynamic content, automate interactions, and avoid detection makes it ideal for developers tackling complex scraping projects.

Next Steps:


Pro Tip: Always respect robots.txt and a website’s terms of service. When in doubt, reach out for permission!

Happy scraping! 🚀

Top comments (0)