Scrape multiple images on the web
This article is about scraping multiple images from a web page. The basic requirement is to get all images from the web page and save them into a local folder, and the additional requirement is to save the images with their titles so that these files can be easily managed or processed later on.
There are some download tools that can be used to save all web images into a folder, but the images mostly are saved with ids or random names that can't be easily understood. So I am implementing this scenario with Python module Clicknium for its easy to start and good experience in capturing a list of similar elements.
Let's have a look at the web page for the image list as below. Each item contains an image, title and price. The expected result is a folder containing all the images with title as their names.
We will cover the scraping in 3 parts as below:
- Development tool preparation
- Capture locator for the image
- Write automation code
Development tool preparation
- Install Visual Studio Code and Clicknium extension.
- Follow the instructions of the quick start document in Clicknium extension to complete the setup.
Capture locator for the image
After setting up the development environment, open an empty folder in VSCode and create a new .py file.
Start capturing the locator by clicking the button below or press "Ctrl+F10".
Once the "Clicknium Recorder" is invoked, click the "Similar elements" button in the recorder to capture a locator for the image list.
After clicking the button, a wizard pops up guiding you to generate a locator which can match all expected images.
Hover mouse cursor over the element and add the first target element by pressing "Ctrl+Click". It can be any of the image list. | |
Once the element is added, the wizard will show how many similar elements can be matched by the added locator. Since only one element is added here, it also matches the same target one for now. We can capture another image from the list to match more. |
|
After adding 3 images to the wizard, we can see that 21 elements are now matched with the locator auto-generated. | |
As there are 22 images in total on the web page, we will continue to add more image elements to the wizard, till 22 elements can all be matched by the auto-generated locator. (If the matched number is not expected, we can always add more elements.) Click "Save" button to complete the wizard. |
|
After capturing the locator, we can open the locator to see its details as below in Visual Studio Code. The detailed properties can be updated manually if it can be optimized further.
From the locator editor panel, we can also click "Validate" button to ensure that all matched 22 elements are expected. After clicking the "Validate" button, a wizard can be operated to locate the target elements one by one. If any target one is incorrect, we may
- recapture the locator by going through the wizard again
- or manually modify the locator in the locator edit panel above.
Capture image titles in the same way as above. The locator definition is as below:
Write Automation Code
With the locators, now we can write code as below
- Get images and titles
- Download image and save it with title as file name
import os
import requests
import shutil
from clicknium import clicknium as cc, locator, ui
# attach to the opened browser, the url is a fake site
tab = cc.edge.attach_by_title_url(url = "https://gallerydemo.com/pages/outerwear")
# get images and titles
imgs = tab.find_elements(locator.msedge.gallerydept.img_out)
titles = tab.find_elements(locator.msedge.gallerydept.span_out)
# iterate every image element
for x in range(len(imgs)):
src = imgs[x].get_property("src")
tstr = titles[x].get_text()
# download image with url and save to folder with title as name
res = requests.get("https:"+src, stream = True)
if res.status_code == 200:
file = "c:\\test\\gallery\\" + tstr + ".png"
# use different name if the title is duplicated
if(os.path.exists(file)):
file = "c:\\test\\gallery\\" + tstr + str(x) + ".png"
with open(file,'wb') as f:
shutil.copyfileobj(res.raw, f)
print('Image sucessfully downloaded: ',tstr)
else:
print('Image Couldn\'t be retrieved')
- The complete code can be found on GitHub.
The execution result is as below. The images are saved in folder c:\test\gallery with title as name and same as the one on the web page.
Conclusion
I demonstrated how to scrape images from the web in this article. With Clicknium "Similar elements" function, it is easy to locate the images by mouse clicking, and write code simply with the generated locator.
The important part is to capture the similar elements, the more elements you add, the auto-generated locator is more accurate. A good practice is to add elements in different locations, like different columns and different rows, so that it has higher coverage to generate correct locator.
Check the Document for more detail about Clicknium.
Top comments (0)