Web scraping is a skill that opens up endless opportunities for data collection. But when dealing with sites like Instagram or Pinterest—where content loads dynamically as you scroll—it’s not as simple as pulling HTML from the page. Regular scraping methods don’t cut it with JavaScript-heavy content. Enter Playwright and lxml. With Playwright handling browser automation and lxml assisting with HTML parsing, you’ll be able to scrape dynamic content like a pro. Ready to get started?
Tools We’ll Use:
Playwright: for automating browser actions.
lxml: for efficient data extraction using XPath.
Python: the engine that makes it all run.
Let's dive into how to scrape Instagram posts, simulate user interactions like scrolling, and handle complex JavaScript-loaded content.
Step 1: Set Up the Required Libraries
Before we begin, you need to install a few key packages. In your terminal, run the following commands:
pip install playwright
pip install lxml
Next, let’s install the necessary browsers for Playwright:
playwright install
With these libraries set up, you are ready to scrape.
Step 2: Configure Playwright for Scraping Dynamic Websites
We'll automate the browser to interact with Instagram’s dynamic content. The goal? To scroll, wait for content to load, and then extract data. Here's the code to set the wheels in motion.
import asyncio
from playwright.async_api import async_playwright
async def scrape_instagram():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True) # No UI, fast execution
page = await browser.new_page()
# Open the profile page
await page.goto("https://www.instagram.com/profile_name/", wait_until="networkidle")
# Click "Show more posts" if it's available
await page.get_by_role("button", name="Show more posts from").click()
# Simulate scrolling to load content
for _ in range(5): # Adjust scroll count as needed
await page.evaluate('window.scrollBy(0, 700);')
await page.wait_for_timeout(3000) # Allow time for content to load
await page.wait_for_load_state("networkidle")
# Grab the content
content = await page.content()
await browser.close()
return content
# Run the scraper
asyncio.run(scrape_instagram())
Step 3: Use lxml and XPath for HTML Parsing
Once you've grabbed the page content, it’s time to parse the HTML and extract the post URLs. Here’s how we do it using lxml and XPath.
from lxml import html
def extract_post_urls(page_content):
tree = html.fromstring(page_content)
post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
post_urls = tree.xpath(post_urls_xpath)
# Convert relative URLs to absolute ones
base_url = "https://www.instagram.com"
post_urls = [f"{base_url}{url}" for url in post_urls]
return post_urls
Now that we’ve got the URLs, let's save them in a nice JSON file.
import json
def save_data(profile_url, post_urls):
data = {profile_url: post_urls}
with open('instagram_posts.json', 'w') as json_file:
json.dump(data, json_file, indent=4)
Step 4: Control Infinite Scrolling with Playwright
Instagram, like many dynamic websites, uses infinite scrolling. This means content loads as you scroll, which is great for user experience but a bit tricky for scraping. We already have the scroll logic in place:
await page.evaluate('window.scrollBy(0, 700);')
await page.wait_for_timeout(3000) # Wait for posts to load
await page.wait_for_load_state("networkidle")
This ensures that we don't grab incomplete data by waiting for the page to finish loading after every scroll.
Step 5: Leverage Proxies to Stay Undetected
Instagram has powerful anti-bot measures in place. To stay under the radar, you’ll need to use proxies. Playwright makes this easy.
Here’s how you can set up a proxy:
async def scrape_with_proxy():
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={"server": "http://your-proxy-server:port"}
)
page = await browser.new_page()
await page.goto("https://www.instagram.com/profile_name/", wait_until="networkidle")
# Continue scraping...
Proxies allow you to avoid getting blocked by rotating IP addresses. You can even pass authentication credentials for proxy servers.
Putting It All Together
Here’s the complete code to scrape Instagram, handle infinite scrolling, and use proxies:
import asyncio
from playwright.async_api import async_playwright
from lxml import html
import json
async def scrape_instagram(profile_url, proxy=None):
async with async_playwright() as p:
browser_options = {'headless': True}
if proxy:
browser_options['proxy'] = proxy
browser = await p.chromium.launch(browser_options)
page = await browser.new_page()
await page.goto(profile_url, wait_until="networkidle")
# Click "Show more posts" (if available)
try:
await page.click('button:has-text("Show more posts from")')
except Exception as e:
print(f"Button not found: {e}")
# Scroll to load posts
for _ in range(5): # Adjust the number of scrolls
await page.evaluate('window.scrollBy(0, 500);')
await page.wait_for_timeout(3000)
await page.wait_for_load_state("networkidle")
content = await page.content()
await browser.close()
return content
def extract_post_urls(page_content):
tree = html.fromstring(page_content)
post_urls_xpath = '//a[contains(@href, "/p/")]/@href'
post_urls = tree.xpath(post_urls_xpath)
base_url = "https://www.instagram.com"
post_urls = [f"{base_url}{url}" for url in post_urls]
return post_urls
def save_data(profile_url, post_urls):
data = {profile_url: post_urls}
with open('instagram_posts.json', 'w') as json_file:
json.dump(data, json_file, indent=4)
print(f"Data saved to instagram_posts.json")
async def main():
profile_url = "https://www.instagram.com/profile_name/"
proxy = {"server": "http://proxy-server:port", "username": "username", "password": "password"}
page_content = await scrape_instagram(profile_url, proxy=proxy)
post_urls = extract_post_urls(page_content)
save_data(profile_url, post_urls)
if __name__ == '__main__':
asyncio.run(main())
Final Thoughts
Scraping dynamic websites can be challenging. With Playwright, you can automate browser actions and extract data from websites that use JavaScript and AJAX. By combining this with lxml for parsing and proxies to avoid detection, you can scrape Instagram and other content-heavy platforms without being blocked. Start building your scraper and automate your data extraction.
Top comments (0)