DEV Community

How to Overcome Scraping Dynamic Website Challenges

Web scraping is one of those skills that can revolutionize how we access and utilize data from the web. But here's the catch—scraping dynamic websites, like Pinterest and Instagram, isn't as simple as it seems. Unlike static sites, these platforms load content on demand via JavaScript, which means traditional scraping methods often fall short.
To scrape these dynamic sites successfully, you need a solid strategy and the right tools. In this post, we’ll dive into using Playwright—a browser automation tool—and lxml for data extraction, helping you scrape Instagram profiles like a pro. And yes, we’ll even cover how to avoid detection (because Instagram will notice you). So, let’s break this down step by step.

Tools We’ll Use:

Playwright: For automating browser actions.
lxml: For parsing and extracting data with XPath.
Python: To glue it all together.

Why Playwright

When scraping dynamic sites, simulating user behavior is essential: scrolling, waiting for content to load, and sometimes clicking buttons to reveal hidden data. Playwright allows you to automate these actions and handle JavaScript-heavy websites efficiently, helping you gather the data quickly. Let’s go through the process of scraping Instagram posts, from setup to extracting URLs, using Playwright and lxml.

Step 1: Get the Libraries Ready

First, we need to install the libraries. Fire up your terminal and run:

pip install playwright
pip install lxml
Enter fullscreen mode Exit fullscreen mode

Then, install Playwright browsers with:

playwright install
Enter fullscreen mode Exit fullscreen mode

Now, we’re set up and ready to automate the browser.

Step 2: Set Up Playwright for Scraping Dynamic Websites

Instagram’s content doesn’t just appear when you visit the page—it loads as you interact with the site. We need to simulate those interactions (scrolling, clicking, waiting) to load posts. Here’s a basic automation script that does the job:

import asyncio
from playwright.async_api import async_playwright

async def scrape_instagram():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)  # Run in headless mode
        page = await browser.new_page()

        # Visit the Instagram profile
        await page.goto("https://www.instagram.com/profile_name/", wait_until="networkidle")

        # Click the button to load more posts (if available)
        await page.get_by_role("button", name="Show more posts").click()

        # Scroll the page to load dynamic content
        for _ in range(5):  # Adjust the number of scrolls based on how many posts you want
            await page.evaluate('window.scrollBy(0, 700);')
            await page.wait_for_timeout(3000)  # Wait for posts to load

        # Get the page content
        content = await page.content()
        await browser.close()

        return content

# Run the asynchronous function
asyncio.run(scrape_instagram())
Enter fullscreen mode Exit fullscreen mode

Step 3: Parse HTML Content Using lxml and XPath

Once you have the page content, the next step is extracting the data you want. We’ll use lxml to parse the HTML and XPath to find the URLs of all Instagram posts.

from lxml import html

def extract_post_urls(page_content):
    tree = html.fromstring(page_content)

    # XPath to extract post URLs
    post_urls_xpath = '//a[contains(@href, "/p/")]/@href'

    post_urls = tree.xpath(post_urls_xpath)

    # Convert relative URLs to absolute
    base_url = "https://www.instagram.com"
    post_urls = [f"{base_url}{url}" for url in post_urls]

    return post_urls
Enter fullscreen mode Exit fullscreen mode

Step 4: Manage Infinite Scroll with Playwright

Instagram uses infinite scrolling, which means you need to keep loading new content as you scroll. Here's how we handle that:

await page.evaluate('window.scrollBy(0, 700);')
await page.wait_for_timeout(3000)  # Wait for new content to load
await page.wait_for_load_state("networkidle")
Enter fullscreen mode Exit fullscreen mode

This code scrolls the page, waits for content to load, and ensures we get all the posts we need.

Step 5: Use Proxies for Stealthy Scraping

Instagram doesn’t take kindly to scrapers. To avoid getting blocked, you can use proxies to rotate IP addresses. Playwright supports proxy integration easily:

async def scrape_with_proxy():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True, 
            proxy={"server": "http://your-proxy-server:port"}
        )
        page = await browser.new_page()
        await page.goto("https://www.instagram.com/profile_name/", wait_until="networkidle")
        # Continue scraping as usual
Enter fullscreen mode Exit fullscreen mode

Proxies help bypass IP bans and CAPTCHA challenges. For more advanced proxy setups, you can even pass a username and password:

async def scrape_with_proxy():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": "http://your-proxy-server:port", "username": "username", "password": "password"}
        )
        page = await browser.new_page()
        await page.goto("https://www.instagram.com/profile_name/", wait_until="networkidle")
Enter fullscreen mode Exit fullscreen mode

Wrapping Up

Playwright, combined with lxml, provides an effective solution for scraping data from dynamic websites like Instagram. Automating the scraping process helps bypass challenges like infinite scrolling and interactive elements. Adding proxies ensures that you can scrape without detection.

Top comments (0)