BeautifulSoup (bs4
) was created over a decade-and-a-half ago. And it's been the standard for web scraping ever since. But it's time for something new, because bs4
is so 2000-and-late.
In this post we'll explore 10 reasons why gazpacho is the future of web scraping, by scraping parts of this post!
1. No Dependencies
gazpacho
is installed at command line:
pip install gazpacho
With no extra dependencies:
pip freeze
# gazpacho==1.1
In contrast, bs4
is packaged with soupsieve
and lxml
. I won't tell you how to write software, but minimizing dependencies is usually a good idea...
2. Batteries Included
The html for this blog post can be fetched and made parse-able with Soup.get
:
from gazpacho import Soup
url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
soup = Soup.get(url)
Unfortunately, you'll need requests
on top of bs4
to do the same thing:
import requests
from bs4 import BeautifulSoup
url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
html = requests.get(url).text
bsoup = BeautifulSoup(html)
3. Simple find
ing
bs4
is a monster. There are 184 methods and attributes attached to every BeautifulSoup
object. Making it hard to know what to use and when to use it:
len(dir(BeautifulSoup()))
# 184
In contrast, Soup
objects in gazpacho
are simple; there are just seven methods and attributes to keep track of:
[method for method in dir(Soup())]
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']
Looking at that list it's clear that to find the title of this post (nested inside of an h1
tag), for example, we'll need to use .find
:
soup.find('h1')
4. Prototyping to Production
gazpacho
is awesome for prototyping and even better for production. By default, .find
will return one Soup
object if it finds just one element, or a list of Soup
objects if it finds more than one.
To guarantee and enforce return types in production the mode=
argument in .find
can be set manually:
title = (soup
.find("header", {'id': 'main-title'}, mode="first")
.find("h1", mode="all")[0]
.text
)
In contrast, bs4
has 27 find methods and they all return something different:
[method for method in dir(BeautifulSoup()) if 'find' in method]
5. PEP 561 Compliant
As of version 1.1, gazpacho
is PEP 561 compliant. Meaning that the entire library is typed and will work with your typed (or standard duck/un-typed!) code-base:
help(soup.find)
# Signature:
# soup.find(
# tag: str,
# attrs: Union[Dict[str, Any], NoneType] = None,
# *,
# partial: bool = True,
# mode: str = 'automatic',
# strict: Union[bool, NoneType] = None,
# ) -> Union[List[ForwardRef('Soup')], ForwardRef('Soup'), NoneType]
6. Automatic Formatting
The html on dev.to and this post is well formatted. But if it weren't:
header = soup.find("div", {'class': 'crayons-article__header__meta'})
html = str(header.find("div", {'class': 'mb-4 spec__tags'}))
bad_html = html.replace("\n", "") # remove new line characters
print(bad_html)
# <div class="mb-4 spec__tags"> <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B"> <span class="crayons-tag__prefix">#</span> python </a> <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:"> <span class="crayons-tag__prefix">#</span> webscraping </a> <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:"> <span class="crayons-tag__prefix">#</span> gazpacho </a> <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368"> <span class="crayons-tag__prefix">#</span> hacktoberfest </a></div>
gazpacho
would be able to automatically format and indent the bad/malformed html:
tags = Soup(bad_html)
Making things easier to read:
print(tags)
# <div class="mb-4 spec__tags">
# <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B">
# <span class="crayons-tag__prefix">#</span>
# python
# </a>
# <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:">
# <span class="crayons-tag__prefix">#</span>
# webscraping
# </a>
# <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:">
# <span class="crayons-tag__prefix">#</span>
# gazpacho
# </a>
# <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368">
# <span class="crayons-tag__prefix">#</span>
# hacktoberfest
# </a>
# </div>
7. Speed
gazpacho
is fast. It takes just 258 µs to scrape the tag links for this post:
%%timeit
tags = Soup(bad_html)
tags = tags.find("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 258 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
While bs4
takes nearly twice as long to do the same thing:
%%timeit
tags = BeautifulSoup(bad_html)
tags = tags.find_all("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 465 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
8. Partial Matching
gazpacho
can partially match html element attributes. For instance, the sidebar for this page is displayed with the following html:
<aside class="crayons-layout__sidebar-right" aria-label="Right sidebar navigation">
And can be matched exactly with:
soup.find("aside", {"class": "crayons-layout__sidebar-right"}, partial=False)
Or partially (the default behaviour) with:
sidebar = soup.find("aside", {'aria-label': 'Right sidebar'}, partial=True)
# finding my name
sidebar.find("span", {'class': 'crayons-subtitle-2'}, partial=True).text
9. Debt-free
gazpacho is Python 3 first, Black, typed with mypy, and about ~400 sloc. It's easy to read through the source:
import inspect
source = inspect.getsource(Soup.find)
print(source)
And like bs4
isn't riddled with Python 2 technical debt.
10. Open (and Friendly)!
Most importantly, gazpacho
is open-source, hosted on GitHub (instead of some clunky custom platform) and looking for contributors.
If you're participating in #hacktoberfest, we'd love to have you. There's a couple of open issues that could use some help!
Top comments (9)
Yes it's true bs4 does have a lot of methods and knowing and remembering them is tough but for most of the work we can simply use common ones.
However where I faced most difficulty is with dynamic pages for example you can go to an ecommerce site and scrape a search result page.
In this type of scenario how would gazpacho work ? @maxhumber maybe you can make a tutorial video with selenium and gazpacho ?
There's some examples of gazpacho + selenium on this website: scrape.world/
I've used bs4 in 3 projects by now and it never had occurred to me to search for alternatives. I'll certainly give gazpacho a chance, seems pretty easy.
Cool!
One thing I've never understood about Beautiful soup is how un-user friendly the docs are. Gazpacho looks so simple in comparison - I'll definitely check out!
Right? 🙈
Why not just simply
lxml
withxpath
? (Who says we have to use BeautifulSoup?)My favorite is Cheerio (in Node.js / web browser), a jQuery analog, though.
Remembering attending your workshops (also these where you've talked about how one of your cousins plunged into a barn without his parachute opened), I wonder why I tomorrow won't be teaching people how to use Gazpacho. I've prepared some extended Pandas exercises (that's only Intro to Big Data). Gazpacho should turn into a tutorial of the kind of Django Girls website to conquer the world.
Bow Bow Pow hahahaha