Prologue
Hello, in this post I am gonna describe the process of writing a scrapper script in Python, with the help of the Beautiful Soup library.
Installing the dependencies
First of all, since Beautiful Soup is a 3rd-party community project, you have to install it via the PyPI registry.
pip install beautifulsoup4
Philosophy of Beautiful Soup
BS is a library that sits atop an HTML/XML parser (in our case it's the prior)
Basic Script
Now that we know how it works, let's write a tiny script:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
WEBSITE = "https://google.com"
html = urlopen(WEBSITE)
bs = BeautifulSoup(html.read(), 'html.parser')
In this example, we also make use of the urllib
requests library, this just downloads the HTML for us.
Then, we read it with the pre-declared html
variable that contains the google.com document
Parsing data
Sometimes, we want to get specific parts of a document, such as a paragraph or an image.
You can search for a specific HTML tag in BeautifulSoup with the find() attribute.
Let's scrape the Google logo tag from their homepage!
Add the following lines of code to the already existing file:
google_logo = bs.find('img', { 'id': 'hplogo' })
print(google_logo)
This two lines of code will hopefully produce this output:
<img
alt="Google"
height="92"
id="hplogo"
src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png"
style="padding:28px 0 14px"
width="272"/>
So, how does this work?
Well, we are using the find() method and passing to it some arguments.
To be exact, we are telling it that we are searching for an <img>
tag with an id called 'hplogo'
Epilogue
That's all.
To learn more about Beautiful Soup, read the docs
Top comments (0)