DEV Community

The Ultimate Guide to Scrape Medium Articles Using Python

If you want to track how a writer is evolving or need a quick way to extract valuable insights from a specific Medium article, scraping Medium with Python is an effective solution. Whether you're a content analyst, data scientist, or developer looking to automate the process, you're in the right place.
In this guide, we’ll break down how to extract key information—like article title, author name, publication, and body content—from Medium’s website. With just a few lines of Python, you’ll be able to scrape articles in no time.

Why Should You Care About Scraping Medium
Medium is a treasure trove of valuable insights. However, sifting through it manually can be time-consuming. Scraping allows you to automate the process of collecting articles, tracking authors, and analyzing trends. It’s also a valuable skill for any data-driven project.
Let’s get started. But before we dive into the code, there are a few things you need to set up.

Prerequisites: What You Need to Install
Before you begin, you'll need to install these libraries:

  • Requests: To send HTTP requests and get content from the web.
  • lxml: For parsing HTML and extracting specific data.
  • Pandas: To save the scraped data in a CSV format for easy analysis. Here’s how to install them: pip install requests pip install lxml pip install pandas

Step 1: Get the Headers and Proxies Right
Medium doesn’t want bots crawling its pages. So, to avoid getting blocked, you’ll need to simulate a real browser request using proper headers. And if you want to play it safe, use proxies to mask your IP address and rotate them to prevent hitting any rate limits.

Setting Up Headers
Headers make your request look like it’s coming from a regular user. Here’s an example:

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
    'accept-language': 'en-IN,en;q=0.9',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
}
Enter fullscreen mode Exit fullscreen mode

Using Proxies
If you're scraping a lot, rotating your IPs helps. Here’s how you can set it up:

proxies = {
    'http': 'http://IP:PORT',
    'https': 'https://IP:PORT'
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Sending the Request
With your headers and proxies set, you’re ready to fetch the article. Here’s the Python code to send a request to the article’s URL:

import requests

response = requests.get(
    'https://medium.com/techtofreedom/9-python-built-in-decorators-that-optimize-your-code-significantly-bc3f661e9017', 
    headers=headers,
    proxies=proxies
)
Enter fullscreen mode Exit fullscreen mode

Step 3: Extracting Data with lxml
Once you have the page content, we’ll need to parse it. With lxml, it’s simple. You can extract specific elements like the title, author name, publication, and more using XPath.
Here’s how to do it:

from lxml.html import fromstring

# Parse the page content
parser = fromstring(response.text)

# Extract the data
title = parser.xpath('//h1[@data-testid="storyTitle"]/text()')[0]
author = parser.xpath('//a[@data-testid="authorName"]/text()')[0]
publication_name = parser.xpath('//a[@data-testid="publicationName"]/p/text()')[0]
publication_date = parser.xpath('//span[@data-testid="storyPublishDate"]/text()')[0]
content = '\n '.join(parser.xpath('//div[@class="ci bh ga gb gc gd"]/p/text()'))
Enter fullscreen mode Exit fullscreen mode

Step 4: Saving Your Data
Now that you have your data, let’s store it for later use. We’ll use Pandas to create a DataFrame and save it as a CSV file. This makes it easy to analyze and track multiple articles over time.

import pandas as pd

# Store data in a dictionary
article_data = {
    'Title': title,
    'Author': author,
    'Publication': publication_name,
    'Date': publication_date,
    'Content': content
}

# Convert to DataFrame and save it as CSV
df = pd.DataFrame([article_data])
df.to_csv('medium_article_data.csv', index=False)

print("Data saved to medium_article_data.csv")
Enter fullscreen mode Exit fullscreen mode

Complete Code
Here’s the complete script for scraping an article from Medium:

import requests
from lxml.html import fromstring
import pandas as pd

# Setting headers to simulate a browser
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
    'accept-language': 'en-IN,en;q=0.9',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
}

# Optional: Using proxies
proxies = {
    'http': 'http://IP:PORT',
    'https': 'https://IP:PORT'
}

# Sending the request
url = 'https://medium.com/techtofreedom/9-python-built-in-decorators-that-optimize-your-code-significantly-bc3f661e9017'
response = requests.get(url, headers=headers, proxies=proxies)

# Parsing the content
parser = fromstring(response.text)

# Extracting the relevant data
title = parser.xpath('//h1[@data-testid="storyTitle"]/text()')[0]
author = parser.xpath('//a[@data-testid="authorName"]/text()')[0]
publication_name = parser.xpath('//a[@data-testid="publicationName"]/p/text()')[0]
publication_date = parser.xpath('//span[@data-testid="storyPublishDate"]/text()')[0]
content = '\n '.join(parser.xpath('//div[@class="ci bh ga gb gc gd"]/p/text()'))

# Organize data into a dictionary
article_data = {
    'Title': title,
    'Author': author,
    'Publication': publication_name,
    'Date': publication_date,
    'Content': content
}

# Save the data to CSV
df = pd.DataFrame([article_data])
df.to_csv('medium_article_data.csv', index=False)

print("Data saved to medium_article_data.csv")
Enter fullscreen mode Exit fullscreen mode

Best Practices for Ethical Scraping
It’s essential to scrape responsibly. Here are a few things to keep in mind:

  • Check Medium’s robots.txt: Make sure you’re allowed to scrape the site.
  • Respect Rate Limits: Don’t overload Medium’s servers with requests. Use delays or proxies to avoid detection.
  • Read the Terms of Service: Some websites may have legal restrictions on scraping, so always check before proceeding.

Wrapping Up
You now have the ability to Medium articles using Python effectively. With this Python script, you can collect, analyze, and store data with ease. Track authors, evaluate content, or gather insights for your next big project. Just remember to scrape responsibly.

Top comments (0)