TL;DR
Web scraping helps you collect and organize data from websites. It is a great way to spot trends, gather insights, and make informed decisions. When done correctly, it can be a powerful tool for everything from research to business strategy.
In this tutorial, you will learn how to scrape data efficiently from websites in Python using Scrapy - an efficient data scraping tool that can act as a general purpose web crawler.
Prerequisites
Before we begin, you should have a basic understanding of Python. You'll also need to set up a few tools on your computer:
- Python: Make sure Python (version 3.x.x or later) is installed.
- A code editor: We'll use PyCharm, but you can use any Python-friendly editor.
- Pip: Python's package manager for installing libraries.
To check if Python and Pip are installed, run these commands in your terminal:
python3 --version
pip3 --version
What is Scrapy?
Scrapy is a powerful framework designed for extracting data from websites. It helps you collect and organize structured data for tasks like data analysis, mining insights, or even archiving information.
While Scrapy is primarily built for web scraping, it can also handle tasks like extracting data from APIs or serving as a general-purpose web crawler.
How to create a new Scrapy project
In this section, I'll guide you through installing Scrapy, setting up a Scrapy project, and understanding the essential files and folders needed to scrape data effectively.
First, create a new pure Python project in PyCharm. It automatically activates a virtual environment for the project.
Run the following command in the terminal to install Scrapy within the virtual environment.
pip3 install scrapy
Create a new Scrapy project by running the code snippet below:
scrapy startproject <project_name>
For example:
scrapy startproject test_scrapy
This command generates a new Scrapy project with the following file structure:
test_scrapy
├── spiders # Contains the spiders you create for scraping websites
│ └── __init__.py # Initializes the spiders folder
├── __init__.py # Initializes the project as a package
├── items.py # Define the structure of the data you want to scrape
├── middlewares.py # Customize how Scrapy processes requests and responses
├── pipelines.py # Process scraped data (e.g., cleaning or saving it)
├── settings.py # Configure your Scrapy project settings
The following explains the files and folder within the Scrapy project:
spiders (folder)
: This is where you define your web crawlers, called spiders. Each spider targets specific websites to scrape.items.py
: It defines the structure (or schema) of the data you want to extract.middlewares.py
: This allows you to customize how requests and responses are processed during scraping.pipelines.py
: This file cleans, validates, or saves the scraped data to storage systems like databases or files.settings.py
: This file configures your project, including user-agent strings, timeouts, and other settings to control the scraping behaviour.
How to scrape data from web pages using Scrapy
In this section, you'll learn how to scrape a single and multi-page websites using Scrapy. Therefore, we'll create two spiders that scrape each website effectively using the pages available at Scrape This Site.
Single Page Web Scraping Example
In this section, I'll walk you through scraping all the countries data on this webpage:
Create a new spider within the spiders
folder using the following code snippet:
scrapy genspider countryspider scrapethissite.com
The code snippet above creates a countryspider.py
file within the spiders
folder.
The countryspider.py
file contains the following code snippet:
import scrapy
class CountryspiderSpider(scrapy.Spider):
name = "countryspider"
allowed_domains = ["scrapethissite.com"]
start_urls = ["https://scrapethissite.com"]
def parse(self, response):
pass
This code defines the Countryspider class - a spider designed to scrape data from websites. Here's what each part does:
-
name
: The unique name of the spider, used when running it from the command line. -
allowed_domains
: Specifies the domains the spider is allowed to scrape, preventing it from accessing other websites. -
start_urls
: The initial URL where the spider will begin scraping. -
parse()
function: A placeholder function where all the web scraping logic will be written. This function processes the data extracted from the target website.
Next, let's update the parse
function and scrape the webpage. Before we procee, update the start_url
attribute to the exact webpage url to be scraped:
import scrapy
class CountryspiderSpider(scrapy.Spider):
name = "countryspider"
allowed_domains = ["scrapethissite.com"]
start_urls = ["https://www.scrapethissite.com/pages/simple/"]
def parse(self, response):
pass
Next, let's scrape the webpage and identify the correct HTML elements to extract data from. Start by running the following commands in your terminal:
scrapy shell
fetch("https://www.scrapethissite.com/pages/simple/")
The scrapy shell
command opens an interactive shell where you can experiment with scraping various data from the website. You can use the shell together with the browser's inspect tab to try out different elements you need to scrape from the website.
The fetch
command retrieves all the webpage's data and stores it in a response
variable for further analysis.
After identifying the necessary HTML attributes for scraping, you can use XPath or CSS selectors to extract the data.
Next, update the parse
function within the Countryspider
class as shown below:
import scrapy
class CountryspiderSpider(scrapy.Spider):
name = "countryspider"
allowed_domains = ["scrapethissite.com"]
start_urls = ["https://www.scrapethissite.com/pages/simple/"]
def parse(self, response):
# contains a list of all the country names
country_names = response.xpath("//h3[@class='country-name']/i/following-sibling::text()").getall()
container = response.css("div.country")
# removes the extra new line spacing in the text
countries = [name.strip() for name in country_names]
# loops through all the data on the page and returns an object containing the specified data
for name, country_data in zip(countries, container):
yield {
'country_name': name,
'country_capital': country_data.css('div.country-info>span.country-capital::text').get(),
'country_population': country_data.css('div.country-info>span.country-population::text').get(),
'country_area' : country_data.css('div.country-info>span.country-area::text').get(),
}
Finally, run the following code snippet to scrape the website, extract the needed data, and save them in a JSON file:
scrapy crawl countryspider -O countries.json
Multi-page Web Scraping Example
Here, you'll learn how to scrape multiple pages using Scrapy by crawling the Scrape This Site: Hockey Team Example.
Add a new spider within the spiders
folder using the following code snippet:
scrapy genspider hockeyspider scrapethissite.com
Update the start_urls
attribute within the hockeyspider.py
file, as shown below:
import scrapy
class HockeyspiderSpider(scrapy.Spider):
name = "hockeyspider"
allowed_domains = ["scrapethissite.com"]
#👇🏻 update the value of the start_urls
start_urls = ["https://www.scrapethissite.com/pages/forms/?per_page=100"]
def parse(self, response):
pass
Run the following code snippet and extract the data within the tables using the XPath or CSS selectors.
scrapy shell
fetch("https://www.scrapethissite.com/pages/simple/")
Before scraping the web pages, define the data structure in the items.py
file. The following code represents the data attributes we’ll be extracting from the web pages:
class HockeyItem(scrapy.Item):
name = scrapy.Field()
year = scrapy.Field()
wins = scrapy.Field()
losses = scrapy.Field()
percent_win = scrapy.Field()
goals_for = scrapy.Field()
goals_against = scrapy.Field()
diffs = scrapy.Field()
Next, update the hockeyspider.py
file to include the newly defined HockeyItem
class:
import scrapy
# import the HockeyItem class
from ..items import HockeyItem
# function that removes the extra spacing
def stripdata(data):
return [dt.strip() for dt in data]
class HockeyspiderSpider(scrapy.Spider):
name = "hockeyspider"
allowed_domains = ["scrapethissite.com"]
start_urls = ["https://www.scrapethissite.com/pages/forms/?per_page=100"]
def parse(self, response):
pass
Update the parse
function to retrieve all the data from the table and return them using the HockeyItem
class:
def parse(self, response):
raw_team_names = response.css("td.name::text").getall()
raw_team_years = response.css("td.year::text").getall()
raw_team_wins = response.css("td.wins::text").getall()
raw_team_losses = response.css("td.losses::text").getall()
raw_percent_win = response.css("td.pct::text").getall()
raw_goals_for = response.css("td.gf::text").getall()
raw_goals_against = response.css("td.ga::text").getall()
raw_diffs = response.css("td.diff::text").getall()
team_names = stripdata(raw_team_names)
team_years = stripdata(raw_team_years)
team_wins = stripdata(raw_team_wins)
team_losses = stripdata(raw_team_losses)
team_percent_win = stripdata(raw_percent_win)
team_goals_for = stripdata(raw_goals_for)
team_goals_against = stripdata(raw_goals_against)
team_diffs = stripdata(raw_diffs)
hockey_item = HockeyItem()
for name, year, wins, losses, percent_win, goals_for, goals_against, diffs \
in zip(team_names, team_years, team_wins, team_losses, team_percent_win, team_goals_for, team_goals_against, team_diffs):
hockey_item['name'] = name
hockey_item['year'] = year
hockey_item['wins'] = wins
hockey_item['losses'] = losses
hockey_item['percent_win'] = percent_win
hockey_item['goals_for'] = goals_for
hockey_item['goals_against'] = goals_against
hockey_item['diffs'] = diffs
yield hockey_item
The code snippet above only crawls the first page and retrieves all the item from the first page.
To scrape multiple pages, we need to modify the parse
function to follow the Next button and continue scraping until it reach the last page:
def parse(self, response):
##... Existing code
for name, year, wins, losses, percent_win, goals_for, goals_against, diffs \
#... Existing code
yield hockey_item
# Follow the next page if available
next_page = response.css("ul.pagination > li:last-child a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
The next_page
variable captures the URL for the next page by extracting the link from the pagination element. The response.follow()
function then tells Scrapy to follow that link and continue scraping the next page.
Finally, run the following command in your terminal to crawl the webpages and save the data in a CSV file:
scrapy crawl hockeyspider -O hockey.csv
Scrapy supports several formats for output, including JSON, JSONL, CSV, XML, Marshal, and Pickle.
You’ve completed a hands-on guide to web scraping using Scrapy.
If you'll love to see Scrapy in action, check out this tutorial:
Next Steps
So far, you've learnt how to scrape a single and multiple pages using Scrapy. However, Scrapy offers many more powerful features to help you build advanced web scraping projects. Here are some resources to guide you:
Thank you for reading! 🥳
Writer's Corner
Hi, I am open to freelance technical writing gigs and remote opportunities. Let's work together. 📧: asaoludavid234 at gmail dot com
Top comments (0)