There is no shortage of excellent datasets on the internet, but you might want to show prospective employers that you're able to find and scrape your own data as well. Plus, knowing how to scrape the web means you can find and use datasets that match your interests, regardless of not they've already been compiled.
Scraping your own dataset also gives you the ability to build your custom dataset for testing as well as for large projects.
Today we're going to scrape the famous news site "Times of India" and find all the link tags available on the page, verify those URLs, and process it.
Setting up
We will start with importing the required libraries and save the URL we want to scrape in the "URL" variable.
from bs4 import BeautifulSoup
import requests
import validators
URL = "https://timesofindia.indiatimes.com/"
Request Page
The first we are going to do is request the page using the requests library, we going to send the get request, and we will receive a response with page content, we can also say that we have downloaded the page.
...
soup = None
res = requests.get(URL)
if res.status_code == 200: ...
Here, save the response in res variables, then we check whether our request was successful, and if the status_code, if status_code is 200, then we parse the page with beautiful soup to extract information.
Parsing Content
In the above code, we have created the soup variable, which will store the parsed page.
...
soup = None
if res.status_code == 200:
soup = BeautifulSoup(res.content, 'html.parser')
Filtering link Tags
We have successfully requested and parsed the web page, now it's time to filter all the link tags from the page.
...
allLinkTag = soup.findAll('a')
unverified = []
for link in allLinkTag:
unverified.append(link.attrs['href'])
Here, we have filtered all the link tags and stored them in the list allLinkTag, and we have created an unverified list, which will store all the URLs, present in those link tags, we had done this by iterating over link tag and extracting the href attribute from the link tag.
Validating URLs
Now we have all the URLs from the page, it is time to validate them.
...
validUrls = []
inValidUrls = []
for url in unverified:
if validators.url(url):
validUrls.append(url)
else:
inValidUrls.append(url)
We have iterated over all the unverified URL, and validated them using the url() function of the validators library. If the URL is valid, we push it to the validUrls list, else push the URL to the inValidUrls list.
Here we have filtered all the valid and invalid URLs from the page, using web scraping.
Using the above procedure you can extract as many websites as you want, and build your custom dataset for testing or for your project.
Checkout the full code on GitHub:
Github: Web Scraping 1
Top comments (0)