David Asaolu for Python and Beyond

Posted on Jan 13

How to Scrape Websites in Python using Scrapy

#webdev #programming #python #tutorial

TL;DR

Web scraping helps you collect and organize data from websites. It is a great way to spot trends, gather insights, and make informed decisions. When done correctly, it can be a powerful tool for everything from research to business strategy.

In this tutorial, you will learn how to scrape data efficiently from websites in Python using Scrapy - an efficient data scraping tool that can act as a general purpose web crawler.

Prerequisites

Before we begin, you should have a basic understanding of Python. You'll also need to set up a few tools on your computer:

Python: Make sure Python (version 3.x.x or later) is installed.
A code editor: We'll use PyCharm, but you can use any Python-friendly editor.
Pip: Python's package manager for installing libraries.

To check if Python and Pip are installed, run these commands in your terminal:

python3 --version
pip3 --version

What is Scrapy?

Scrapy is a powerful framework designed for extracting data from websites. It helps you collect and organize structured data for tasks like data analysis, mining insights, or even archiving information.

While Scrapy is primarily built for web scraping, it can also handle tasks like extracting data from APIs or serving as a general-purpose web crawler.

How to create a new Scrapy project

In this section, I'll guide you through installing Scrapy, setting up a Scrapy project, and understanding the essential files and folders needed to scrape data effectively.

First, create a new pure Python project in PyCharm. It automatically activates a virtual environment for the project.

Run the following command in the terminal to install Scrapy within the virtual environment.

pip3 install scrapy

Create a new Scrapy project by running the code snippet below:

scrapy startproject <project_name>

For example:

scrapy startproject test_scrapy

This command generates a new Scrapy project with the following file structure:

test_scrapy  
├── spiders         # Contains the spiders you create for scraping websites  
│   └── __init__.py # Initializes the spiders folder  
├── __init__.py     # Initializes the project as a package  
├── items.py        # Define the structure of the data you want to scrape  
├── middlewares.py  # Customize how Scrapy processes requests and responses  
├── pipelines.py    # Process scraped data (e.g., cleaning or saving it)  
├── settings.py     # Configure your Scrapy project settings

The following explains the files and folder within the Scrapy project:

spiders (folder): This is where you define your web crawlers, called spiders. Each spider targets specific websites to scrape.
items.py: It defines the structure (or schema) of the data you want to extract.
middlewares.py: This allows you to customize how requests and responses are processed during scraping.
pipelines.py: This file cleans, validates, or saves the scraped data to storage systems like databases or files.
settings.py: This file configures your project, including user-agent strings, timeouts, and other settings to control the scraping behaviour.

How to scrape data from web pages using Scrapy

In this section, you'll learn how to scrape a single and multi-page websites using Scrapy. Therefore, we'll create two spiders that scrape each website effectively using the pages available at Scrape This Site.

Single Page Web Scraping Example

In this section, I'll walk you through scraping all the countries data on this webpage:

Create a new spider within the spiders folder using the following code snippet:

scrapy genspider countryspider scrapethissite.com

The code snippet above creates a countryspider.py file within the spiders folder.

The countryspider.py file contains the following code snippet:

import scrapy

class CountryspiderSpider(scrapy.Spider):
    name = "countryspider"
    allowed_domains = ["scrapethissite.com"]
    start_urls = ["https://scrapethissite.com"]

    def parse(self, response):
        pass

This code defines the Countryspider class - a spider designed to scrape data from websites. Here's what each part does:

name: The unique name of the spider, used when running it from the command line.
allowed_domains: Specifies the domains the spider is allowed to scrape, preventing it from accessing other websites.
start_urls: The initial URL where the spider will begin scraping.
parse() function: A placeholder function where all the web scraping logic will be written. This function processes the data extracted from the target website.

Next, let's update the parse function and scrape the webpage. Before we procee, update the start_url attribute to the exact webpage url to be scraped:

import scrapy

class CountryspiderSpider(scrapy.Spider):
    name = "countryspider"
    allowed_domains = ["scrapethissite.com"]
    start_urls = ["https://www.scrapethissite.com/pages/simple/"]

    def parse(self, response):
        pass

Next, let's scrape the webpage and identify the correct HTML elements to extract data from. Start by running the following commands in your terminal:

scrapy shell
fetch("https://www.scrapethissite.com/pages/simple/")

The scrapy shell command opens an interactive shell where you can experiment with scraping various data from the website. You can use the shell together with the browser's inspect tab to try out different elements you need to scrape from the website.

The fetch command retrieves all the webpage's data and stores it in a response variable for further analysis.

After identifying the necessary HTML attributes for scraping, you can use XPath or CSS selectors to extract the data.

Next, update the parse function within the Countryspider class as shown below:

import scrapy

class CountryspiderSpider(scrapy.Spider):
    name = "countryspider"
    allowed_domains = ["scrapethissite.com"]
    start_urls = ["https://www.scrapethissite.com/pages/simple/"]

    def parse(self, response):
        # contains a list of all the country names
        country_names = response.xpath("//h3[@class='country-name']/i/following-sibling::text()").getall()
        container = response.css("div.country")
        # removes the extra new line spacing in the text
        countries = [name.strip() for name in country_names]
        # loops through all the data on the page and returns an object containing the specified data
        for name, country_data  in zip(countries, container):
            yield {
                'country_name': name,
                'country_capital': country_data.css('div.country-info>span.country-capital::text').get(),
                'country_population': country_data.css('div.country-info>span.country-population::text').get(),
                'country_area' : country_data.css('div.country-info>span.country-area::text').get(),
           }

Finally, run the following code snippet to scrape the website, extract the needed data, and save them in a JSON file:

scrapy crawl countryspider -O countries.json

Multi-page Web Scraping Example

Here, you'll learn how to scrape multiple pages using Scrapy by crawling the Scrape This Site: Hockey Team Example.

Add a new spider within the spiders folder using the following code snippet:

scrapy genspider hockeyspider scrapethissite.com

Update the start_urls attribute within the hockeyspider.py file, as shown below:

import scrapy

class HockeyspiderSpider(scrapy.Spider):
    name = "hockeyspider"
    allowed_domains = ["scrapethissite.com"]
    #👇🏻 update the value of the start_urls
    start_urls = ["https://www.scrapethissite.com/pages/forms/?per_page=100"]

    def parse(self, response):
        pass

Run the following code snippet and extract the data within the tables using the XPath or CSS selectors.

scrapy shell
fetch("https://www.scrapethissite.com/pages/simple/")

Before scraping the web pages, define the data structure in the items.py file. The following code represents the data attributes we’ll be extracting from the web pages:

class HockeyItem(scrapy.Item):
    name = scrapy.Field()
    year = scrapy.Field()
    wins = scrapy.Field()
    losses = scrapy.Field()
    percent_win = scrapy.Field()
    goals_for = scrapy.Field()
    goals_against = scrapy.Field()
    diffs = scrapy.Field()

Next, update the hockeyspider.py file to include the newly defined HockeyItem class:

import scrapy

# import the HockeyItem class
from ..items import HockeyItem

# function that removes the extra spacing
def stripdata(data):
    return [dt.strip() for dt in data]

class HockeyspiderSpider(scrapy.Spider):
    name = "hockeyspider"
    allowed_domains = ["scrapethissite.com"]
    start_urls = ["https://www.scrapethissite.com/pages/forms/?per_page=100"]

    def parse(self, response):
        pass

Update the parse function to retrieve all the data from the table and return them using the HockeyItem class:

    def parse(self, response):
        raw_team_names = response.css("td.name::text").getall()
        raw_team_years = response.css("td.year::text").getall()
        raw_team_wins = response.css("td.wins::text").getall()
        raw_team_losses = response.css("td.losses::text").getall()
        raw_percent_win = response.css("td.pct::text").getall()
        raw_goals_for = response.css("td.gf::text").getall()
        raw_goals_against = response.css("td.ga::text").getall()
        raw_diffs = response.css("td.diff::text").getall()

        team_names = stripdata(raw_team_names)
        team_years = stripdata(raw_team_years)
        team_wins = stripdata(raw_team_wins)
        team_losses = stripdata(raw_team_losses)
        team_percent_win = stripdata(raw_percent_win)
        team_goals_for = stripdata(raw_goals_for)
        team_goals_against = stripdata(raw_goals_against)
        team_diffs = stripdata(raw_diffs)
        hockey_item = HockeyItem()

        for name, year, wins, losses, percent_win, goals_for, goals_against, diffs \
            in zip(team_names, team_years, team_wins, team_losses, team_percent_win, team_goals_for, team_goals_against, team_diffs):
            hockey_item['name'] = name
            hockey_item['year'] = year
            hockey_item['wins'] = wins
            hockey_item['losses'] = losses
            hockey_item['percent_win'] = percent_win
            hockey_item['goals_for'] = goals_for
            hockey_item['goals_against'] = goals_against
            hockey_item['diffs'] = diffs
            yield hockey_item

The code snippet above only crawls the first page and retrieves all the item from the first page.

To scrape multiple pages, we need to modify the parse function to follow the Next button and continue scraping until it reach the last page:

def parse(self, response):
##... Existing code
    for name, year, wins, losses, percent_win, goals_for, goals_against, diffs \
    #... Existing code
        yield hockey_item
    # Follow the next page if available
    next_page = response.css("ul.pagination > li:last-child a::attr(href)").get()
    if next_page:
        yield response.follow(next_page, callback=self.parse)

The next_page variable captures the URL for the next page by extracting the link from the pagination element. The response.follow() function then tells Scrapy to follow that link and continue scraping the next page.

Finally, run the following command in your terminal to crawl the webpages and save the data in a CSV file:

scrapy crawl hockeyspider -O hockey.csv

Scrapy supports several formats for output, including JSON, JSONL, CSV, XML, Marshal, and Pickle.

You’ve completed a hands-on guide to web scraping using Scrapy.

If you'll love to see Scrapy in action, check out this tutorial:

Next Steps

So far, you've learnt how to scrape a single and multiple pages using Scrapy. However, Scrapy offers many more powerful features to help you build advanced web scraping projects. Here are some resources to guide you:

Thank you for reading! 🥳