Swiftproxy - Residential Proxies

Posted on Jan 22

Efficient Techniques for Scraping Dynamic Websites

#webscraping

Web scraping is a skill that opens up endless opportunities for data collection. But when dealing with sites like Instagram or Pinterest—where content loads dynamically as you scroll—it’s not as simple as pulling HTML from the page. Regular scraping methods don’t cut it with JavaScript-heavy content. Enter Playwright and lxml. With Playwright handling browser automation and lxml assisting with HTML parsing, you’ll be able to scrape dynamic content like a pro. Ready to get started?

Tools We’ll Use:

Playwright: for automating browser actions.
lxml: for efficient data extraction using XPath.
Python: the engine that makes it all run.
Let's dive into how to scrape Instagram posts, simulate user interactions like scrolling, and handle complex JavaScript-loaded content.

Step 1: Set Up the Required Libraries

Before we begin, you need to install a few key packages. In your terminal, run the following commands:

pip install playwright
pip install lxml

Next, let’s install the necessary browsers for Playwright:

playwright install

With these libraries set up, you are ready to scrape.

Step 2: Configure Playwright for Scraping Dynamic Websites

We'll automate the browser to interact with Instagram’s dynamic content. The goal? To scroll, wait for content to load, and then extract data. Here's the code to set the wheels in motion.

import asyncio
from playwright.async_api import async_playwright

async def scrape_instagram():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)  # No UI, fast execution
        page = await browser.new_page()

        # Open the profile page
        await page.goto("https://www.instagram.com/profile_name/", wait_until="networkidle")

        # Click "Show more posts" if it's available
        await page.get_by_role("button", name="Show more posts from").click()

        # Simulate scrolling to load content
        for _ in range(5):  # Adjust scroll count as needed
            await page.evaluate('window.scrollBy(0, 700);')
            await page.wait_for_timeout(3000)  # Allow time for content to load
            await page.wait_for_load_state("networkidle")

        # Grab the content
        content = await page.content()
        await browser.close()

        return content

# Run the scraper
asyncio.run(scrape_instagram())

Step 3: Use lxml and XPath for HTML Parsing

Once you've grabbed the page content, it’s time to parse the HTML and extract the post URLs. Here’s how we do it using lxml and XPath.

from lxml import html

def extract_post_urls(page_content):
    tree = html.fromstring(page_content)
    post_urls_xpath = '//a[contains(@href, "/p/")]/@href'

    post_urls = tree.xpath(post_urls_xpath)

    # Convert relative URLs to absolute ones
    base_url = "https://www.instagram.com"
    post_urls = [f"{base_url}{url}" for url in post_urls]

    return post_urls

Now that we’ve got the URLs, let's save them in a nice JSON file.

import json

def save_data(profile_url, post_urls):
    data = {profile_url: post_urls}
    with open('instagram_posts.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)

Step 4: Control Infinite Scrolling with Playwright

Instagram, like many dynamic websites, uses infinite scrolling. This means content loads as you scroll, which is great for user experience but a bit tricky for scraping. We already have the scroll logic in place:

await page.evaluate('window.scrollBy(0, 700);')
await page.wait_for_timeout(3000)  # Wait for posts to load
await page.wait_for_load_state("networkidle")

This ensures that we don't grab incomplete data by waiting for the page to finish loading after every scroll.

Step 5: Leverage Proxies to Stay Undetected

Instagram has powerful anti-bot measures in place. To stay under the radar, you’ll need to use proxies. Playwright makes this easy.
Here’s how you can set up a proxy:

async def scrape_with_proxy():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True, 
            proxy={"server": "http://your-proxy-server:port"}
        )
        page = await browser.new_page()
        await page.goto("https://www.instagram.com/profile_name/", wait_until="networkidle")
        # Continue scraping...

Proxies allow you to avoid getting blocked by rotating IP addresses. You can even pass authentication credentials for proxy servers.

Putting It All Together

Here’s the complete code to scrape Instagram, handle infinite scrolling, and use proxies:

import asyncio
from playwright.async_api import async_playwright
from lxml import html
import json

async def scrape_instagram(profile_url, proxy=None):
    async with async_playwright() as p:
        browser_options = {'headless': True}
        if proxy:
            browser_options['proxy'] = proxy

        browser = await p.chromium.launch(browser_options)
        page = await browser.new_page()

        await page.goto(profile_url, wait_until="networkidle")

        # Click "Show more posts" (if available)
        try:
            await page.click('button:has-text("Show more posts from")')
        except Exception as e:
            print(f"Button not found: {e}")

        # Scroll to load posts
        for _ in range(5):  # Adjust the number of scrolls
            await page.evaluate('window.scrollBy(0, 500);')
            await page.wait_for_timeout(3000)
            await page.wait_for_load_state("networkidle")

        content = await page.content()
        await browser.close()

        return content

def extract_post_urls(page_content):
    tree = html.fromstring(page_content)
    post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
    post_urls = tree.xpath(post_urls_xpath)

    base_url = "https://www.instagram.com"
    post_urls = [f"{base_url}{url}" for url in post_urls]

    return post_urls

def save_data(profile_url, post_urls):
    data = {profile_url: post_urls}
    with open('instagram_posts.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Data saved to instagram_posts.json")

async def main():
    profile_url = "https://www.instagram.com/profile_name/"
    proxy = {"server": "http://proxy-server:port", "username": "username", "password": "password"}

    page_content = await scrape_instagram(profile_url, proxy=proxy)
    post_urls = extract_post_urls(page_content)
    save_data(profile_url, post_urls)

if __name__ == '__main__':
    asyncio.run(main())

Final Thoughts

Scraping dynamic websites can be challenging. With Playwright, you can automate browser actions and extract data from websites that use JavaScript and AJAX. By combining this with lxml for parsing and proxies to avoid detection, you can scrape Instagram and other content-heavy platforms without being blocked. Start building your scraper and automate your data extraction.

DEV Community

Efficient Techniques for Scraping Dynamic Websites

Tools We’ll Use:

Step 1: Set Up the Required Libraries

Step 2: Configure Playwright for Scraping Dynamic Websites

Step 3: Use lxml and XPath for HTML Parsing

Step 4: Control Infinite Scrolling with Playwright

Step 5: Leverage Proxies to Stay Undetected

Putting It All Together

Final Thoughts

Top comments (0)

Read next

CI/CD Explained: How Continuous Integration & Deployment Improve Software Delivery Introduction

Navigating the Cybersecurity Maze: Challenges and Solutions in AI Agent Development

DeepDive in everything of Llama3: revealing detailed insights and implementation from scratch

Saudi Riyal Font – Easily Display SAR Currency Symbol in HTML