Max Humber

Posted on Oct 9, 2020

BeautifulSoup is so 2000-and-late: Web Scraping in 2020

#python #webscraping #gazpacho #hacktoberfest

BeautifulSoup (bs4) was created over a decade-and-a-half ago. And it's been the standard for web scraping ever since. But it's time for something new, because bs4 is so 2000-and-late.

In this post we'll explore 10 reasons why gazpacho is the future of web scraping, by scraping parts of this post!

1. No Dependencies

gazpacho is installed at command line:

pip install gazpacho

With no extra dependencies:

pip freeze
# gazpacho==1.1

In contrast, bs4 is packaged with soupsieve and lxml. I won't tell you how to write software, but minimizing dependencies is usually a good idea...

2. Batteries Included

The html for this blog post can be fetched and made parse-able with Soup.get:

from gazpacho import Soup

url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
soup = Soup.get(url)

Unfortunately, you'll need requests on top of bs4 to do the same thing:

import requests
from bs4 import BeautifulSoup

url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
html = requests.get(url).text
bsoup = BeautifulSoup(html)

3. Simple `find`ing

bs4 is a monster. There are 184 methods and attributes attached to every BeautifulSoup object. Making it hard to know what to use and when to use it:

len(dir(BeautifulSoup()))
# 184

In contrast, Soup objects in gazpacho are simple; there are just seven methods and attributes to keep track of:

[method for method in dir(Soup())]
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']

Looking at that list it's clear that to find the title of this post (nested inside of an h1 tag), for example, we'll need to use .find:

soup.find('h1')

4. Prototyping to Production

gazpacho is awesome for prototyping and even better for production. By default, .find will return one Soup object if it finds just one element, or a list of Soup objects if it finds more than one.

To guarantee and enforce return types in production the mode= argument in .find can be set manually:

title = (soup
    .find("header", {'id': 'main-title'}, mode="first")
    .find("h1", mode="all")[0]
    .text
)

In contrast, bs4 has 27 find methods and they all return something different:

[method for method in dir(BeautifulSoup()) if 'find' in method]

5. PEP 561 Compliant

As of version 1.1, gazpacho is PEP 561 compliant. Meaning that the entire library is typed and will work with your typed (or standard duck/un-typed!) code-base:

help(soup.find)
# Signature:
# soup.find(
#     tag: str,
#     attrs: Union[Dict[str, Any], NoneType] = None,
#     *,
#     partial: bool = True,
#     mode: str = 'automatic',
#     strict: Union[bool, NoneType] = None,
# ) -> Union[List[ForwardRef('Soup')], ForwardRef('Soup'), NoneType]

6. Automatic Formatting

The html on dev.to and this post is well formatted. But if it weren't:

header = soup.find("div", {'class': 'crayons-article__header__meta'})
html = str(header.find("div", {'class': 'mb-4 spec__tags'}))
bad_html = html.replace("\n", "") # remove new line characters
print(bad_html)
# <div class="mb-4 spec__tags">  <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B">    <span class="crayons-tag__prefix">#</span>    python  </a>  <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:">    <span class="crayons-tag__prefix">#</span>    webscraping  </a>  <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:">    <span class="crayons-tag__prefix">#</span>    gazpacho  </a>  <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368">    <span class="crayons-tag__prefix">#</span>    hacktoberfest  </a></div>

gazpacho would be able to automatically format and indent the bad/malformed html:

tags = Soup(bad_html)

Making things easier to read:

print(tags)
# <div class="mb-4 spec__tags">
#   <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B">
#     <span class="crayons-tag__prefix">#</span>
#         python
#   </a>
#   <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:">
#     <span class="crayons-tag__prefix">#</span>
#         webscraping
#   </a>
#   <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:">
#     <span class="crayons-tag__prefix">#</span>
#         gazpacho
#   </a>
#   <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368">
#     <span class="crayons-tag__prefix">#</span>
#         hacktoberfest
#   </a>
# </div>

7. Speed

gazpacho is fast. It takes just 258 µs to scrape the tag links for this post:

%%timeit
tags = Soup(bad_html)
tags = tags.find("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 258 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

While bs4 takes nearly twice as long to do the same thing:

%%timeit
tags = BeautifulSoup(bad_html)
tags = tags.find_all("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 465 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

8. Partial Matching

gazpacho can partially match html element attributes. For instance, the sidebar for this page is displayed with the following html:

<aside class="crayons-layout__sidebar-right" aria-label="Right sidebar navigation">

And can be matched exactly with:

soup.find("aside", {"class": "crayons-layout__sidebar-right"}, partial=False)

Or partially (the default behaviour) with:

sidebar = soup.find("aside", {'aria-label': 'Right sidebar'}, partial=True)

# finding my name
sidebar.find("span", {'class': 'crayons-subtitle-2'}, partial=True).text

9. Debt-free

gazpacho is Python 3 first, Black, typed with mypy, and about ~400 sloc. It's easy to read through the source:

import inspect

source = inspect.getsource(Soup.find)
print(source)

And like bs4 isn't riddled with Python 2 technical debt.

10. Open (and Friendly)!

Most importantly, gazpacho is open-source, hosted on GitHub (instead of some clunky custom platform) and looking for contributors.

If you're participating in #hacktoberfest, we'd love to have you. There's a couple of open issues that could use some help!

Top comments (9)

Mohiuddin Sumon • Oct 10 '20

Yes it's true bs4 does have a lot of methods and knowing and remembering them is tough but for most of the work we can simply use common ones.

However where I faced most difficulty is with dynamic pages for example you can go to an ecommerce site and scrape a search result page.

In this type of scenario how would gazpacho work ? @maxhumber maybe you can make a tutorial video with selenium and gazpacho ?

Max Humber • Oct 10 '20

There's some examples of gazpacho + selenium on this website: scrape.world/

Guilherme Bauer-Negrini • Oct 10 '20

I've used bs4 in 3 projects by now and it never had occurred to me to search for alternatives. I'll certainly give gazpacho a chance, seems pretty easy.

Robert • Oct 10 '20

Cool!

Tom Quirk • Oct 10 '20

One thing I've never understood about Beautiful soup is how un-user friendly the docs are. Gazpacho looks so simple in comparison - I'll definitely check out!

Max Humber • Oct 10 '20

Right? 🙈

Pacharapol Withayasakpunt • Oct 10 '20

Why not just simply lxml with xpath? (Who says we have to use BeautifulSoup?)

My favorite is Cheerio (in Node.js / web browser), a jQuery analog, though.

mehmeh-ctrl • Mar 21 '21

Remembering attending your workshops (also these where you've talked about how one of your cousins plunged into a barn without his parachute opened), I wonder why I tomorrow won't be teaching people how to use Gazpacho. I've prepared some extended Pandas exercises (that's only Intro to Big Data). Gazpacho should turn into a tutorial of the kind of Django Girls website to conquer the world.

Yuri Costa • Oct 9 '20

Bow Bow Pow hahahaha

DEV Community

BeautifulSoup is so 2000-and-late: Web Scraping in 2020

1. No Dependencies

2. Batteries Included

3. Simple `find`ing

4. Prototyping to Production

5. PEP 561 Compliant

6. Automatic Formatting

7. Speed

8. Partial Matching

9. Debt-free

10. Open (and Friendly)!

Top comments (9)

Read next

Interactive DataFrame Management with Streamlit Fragments 🚀

¡Hola Wagtail!

Python 🐍 and variable types

6 Essential Python Design Patterns for Scalable Software Architecture

1. No Dependencies

2. Batteries Included

3. Simple finding

4. Prototyping to Production

5. PEP 561 Compliant

6. Automatic Formatting

7. Speed

8. Partial Matching

9. Debt-free

10. Open (and Friendly)!

Read next

Interactive DataFrame Management with Streamlit Fragments 🚀

¡Hola Wagtail!

Python 🐍 and variable types

6 Essential Python Design Patterns for Scalable Software Architecture

3. Simple `find`ing