DEV Community

The Reasons to Scrape IMDB Data

There’s an endless stream of valuable data available on the internet—and IMDB’s Top 250 movies is a goldmine. Whether you're researching movie trends or just want to explore the world of cinema, scraping data from IMDB can open up new opportunities for insight and analysis.
But how do you access this data without running into roadblocks? In this tutorial, I’ll walk you through scraping IMDB’s Top 250 movies using Python, extracting key details like movie titles, summaries, ratings, and genres. Ready? Let’s dive in.

Reasons to Scrape IMDB Data

IMDB is a treasure trove of information—movie titles, reviews, ratings, and genres—all neatly packaged for users. But how can you extract this data for your own analysis? By scraping. It’s a straightforward process when you use the right tools. Scraping can help you unlock insights about movie trends, analyze ratings, or feed your personal project.
Here’s the key—when scraping, we need to mimic human behavior to avoid detection. This means avoiding overloading servers, using proxies, and making our requests look natural. If done right, it is both efficient and ethical.

Step 1: Prepare to Scrape Data

We’ll use Python’s requests library for making requests and lxml for parsing the HTML. First, let’s install these libraries. Open your terminal and run:

pip install requests lxml
Enter fullscreen mode Exit fullscreen mode

These will allow us to download web pages, parse the content, and extract the relevant data.

Configuring HTTP Headers

When scraping, we don’t want our requests to look like bots. To avoid detection, we’ll configure HTTP headers to simulate a real web browser. Here’s an example:

import requests

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
}

response = requests.get('https://www.IMDB.com/chart/top/', headers=headers)
Enter fullscreen mode Exit fullscreen mode

By setting a user-agent, you make the request look like it’s coming from a real user, not a bot.

Step 2: Handle Proxies

Websites often limit how many requests you can make from a single IP address. To avoid being blocked, we’ll use proxies. This helps us spread requests across multiple IPs, making our scraping activity more discreet.
Here’s how to use a proxy:

proxies = {
    "http": "http://your_proxy_server",
    "https": "https://your_proxy_server"
}

response = requests.get('https://www.IMDB.com/chart/top/', headers=headers, proxies=proxies)
Enter fullscreen mode Exit fullscreen mode

This ensures that your real IP address remains hidden, reducing the risk of being flagged.

Step 3: Parse the HTML

Once you’ve made your request, it’s time to parse the content. IMDB’s Top 250 page embeds structured data in JSON-LD format, which makes it easy for us to extract movie information. We’ll use lxml for parsing HTML and json for handling the structured data.

from lxml.html import fromstring
import json

# Parse the HTML response
parser = fromstring(response.text)

# Extract JSON-LD data
raw_data = parser.xpath('//script[@type="application/ld+json"]/text()')[0]
json_data = json.loads(raw_data)
Enter fullscreen mode Exit fullscreen mode

This gives us all the movie data in a structured format, which is easy to work with.

Step 4: Extract the Data

Now, let's extract the key movie details: title, description, rating, genre, etc. We’ll loop through the JSON data and pull out the information we need.

movies_details = json_data.get('itemListElement')
movies_data = []

for movie in movies_details:
    movie_data = movie['item']
    movie_info = {
        'name': movie_data['name'],
        'description': movie_data['description'],
        'rating': movie_data['aggregateRating']['ratingValue'],
        'genre': movie_data['genre'],
        'url': movie_data['url']
    }
    movies_data.append(movie_info)
Enter fullscreen mode Exit fullscreen mode

Now you have the movie details neatly stored in a list.

Step 5: Save the Extracted Information

Once we’ve extracted the data, the next step is storing it. For easy analysis, let's save it in a CSV file using pandas. Here’s how:

import pandas as pd

# Convert the list to a DataFrame
df = pd.DataFrame(movies_data)

# Save to CSV
df.to_csv('IMDB_top_250_movies.csv', index=False)
print("IMDB Top 250 movies data saved to IMDB_top_250_movies.csv")
Enter fullscreen mode Exit fullscreen mode

You’ve scraped and stored the Top 250 movies in just a few steps.

The Full Script

Here’s the full code for your reference:

import requests
from lxml.html import fromstring
import json
import pandas as pd

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
}

proxies = {
    "http": "http://your_proxy_server",
    "https": "https://your_proxy_server"
}

response = requests.get('https://www.IMDB.com/chart/top/', headers=headers, proxies=proxies)

# Parse the HTML response
parser = fromstring(response.text)

# Extract the JSON-LD data
raw_data = parser.xpath('//script[@type="application/ld+json"]/text()')[0]
json_data = json.loads(raw_data)

# Extract movie details
movies_details = json_data.get('itemListElement')
movies_data = []

for movie in movies_details:
    movie_data = movie['item']
    movie_info = {
        'name': movie_data['name'],
        'description': movie_data['description'],
        'rating': movie_data['aggregateRating']['ratingValue'],
        'genre': movie_data['genre'],
        'url': movie_data['url']
    }
    movies_data.append(movie_info)

# Save to CSV
df = pd.DataFrame(movies_data)
df.to_csv('IMDB_top_250_movies.csv', index=False)
print("IMDB Top 250 movies data saved to IMDB_top_250_movies.csv")
Enter fullscreen mode Exit fullscreen mode

Following Ethical Standards in Scraping

Before you start scraping any site, remember: ethics first.

  1. Follow the robots.txt: Check IMDB’s robots.txt file to see what is allowed for scraping.
  2. Be gentle: Don't bombard their servers with requests. Space them out.
  3. Comply with terms of service: Always make sure you’re not violating any rules. Scraping for personal use is fine, but avoid excessive data requests.

Conclusion

Scraping IMDB’s Top 250 movies using Python is a great way to gather movie data for analysis or personal projects. By using the right tools and following ethical practices, you can gather rich, structured data from the web without running into issues. Start your Python environment and begin scraping.

Top comments (0)