Karthik Subramanian for AWS Community Builders

Posted on Jun 19, 2023 • Edited on Aug 16, 2023 • Originally published at Medium

Web Scraping with Selenium & AWS Lambda

#selenium #lambda #webscraping #aws

In my last post I created a lambda that accepts a request, stores it in a dynamodb table and sends a message to an SQS queue.

Let’s now create another lambda to read from that queue and process the request by scraping the url using selenium.

Installing Selenium

Create a new file under src called “chrome-deps.txt” and copy the following into it -

acl adwaita-cursor-theme adwaita-icon-theme alsa-lib at-spi2-atk at-spi2-core
atk avahi-libs cairo cairo-gobject colord-libs cryptsetup-libs cups-libs dbus
dbus-libs dconf desktop-file-utils device-mapper device-mapper-libs elfutils-default-yama-scope
elfutils-libs emacs-filesystem fribidi gdk-pixbuf2 glib-networking gnutls graphite2
gsettings-desktop-schemas gtk-update-icon-cache gtk3 harfbuzz hicolor-icon-theme hwdata jasper-libs
jbigkit-libs json-glib kmod kmod-libs lcms2 libX11 libX11-common libXau libXcomposite libXcursor libXdamage
libXext libXfixes libXft libXi libXinerama libXrandr libXrender libXtst libXxf86vm libdrm libepoxy
liberation-fonts liberation-fonts-common liberation-mono-fonts liberation-narrow-fonts liberation-sans-fonts
liberation-serif-fonts libfdisk libglvnd libglvnd-egl libglvnd-glx libgusb libidn libjpeg-turbo libmodman
libpciaccess libproxy libsemanage libsmartcols libsoup libthai libtiff libusbx libutempter libwayland-client
libwayland-cursor libwayland-egl libwayland-server libxcb libxkbcommon libxshmfence lz4 mesa-libEGL mesa-libGL
mesa-libgbm mesa-libglapi nettle pango pixman qrencode-libs rest shadow-utils systemd systemd-libs trousers ustr
util-linux vulkan vulkan-filesystem wget which xdg-utils xkeyboard-config

Create another file called “install-browser.sh” and copy the following -

#!/bin/bash

echo "Downloading Chromium..."

curl "https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2F$CHROMIUM_VERSION%2Fchrome-linux.zip?generation=1652397748160413&alt=media" > /tmp/chromium.zip

unzip /tmp/chromium.zip -d /tmp/

mv /tmp/chrome-linux/ /opt/chrome

curl "https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2F$CHROMIUM_VERSION%2Fchromedriver_linux64.zip?generation=1652397753719852&alt=media" > /tmp/chromedriver_linux64.zip

unzip /tmp/chromedriver_linux64.zip -d /tmp/

mv /tmp/chromedriver_linux64/chromedriver /opt/chromedriver

Update the Dockerfile to look like this -

FROM public.ecr.aws/lambda/python:3.9 as stage

# Hack to install chromium dependencies

RUN yum install -y -q sudo unzip

# Current stable version of Chromium

ENV CHROMIUM_VERSION=1002910

# Install Chromium

COPY install-browser.sh /tmp/

RUN /usr/bin/bash /tmp/install-browser.sh

FROM public.ecr.aws/lambda/python:3.9 as base

COPY chrome-deps.txt /tmp/

RUN yum install -y $(cat /tmp/chrome-deps.txt)

COPY --from=stage /opt/chrome /opt/chrome

COPY --from=stage /opt/chromedriver /opt/chromedriver

COPY create.py ${LAMBDA_TASK_ROOT}
COPY process.py ${LAMBDA_TASK_ROOT}

COPY requirements.txt ${LAMBDA_TASK_ROOT}

COPY db/ ${LAMBDA_TASK_ROOT}/db/

RUN python3.9 -m pip install -r requirements.txt -t .

Update the requirements.txt file and add

selenium==4.4.2

And install the dependency

pip install -r src/requirements.txt

Process the request

Create a new file under src for the new lambda function called “process.py”

Finally, modify the template.yaml file to tell SAM about the new lambda -

Since we created a new lambda function, we need to tell aws where to grab the image from. Modify the samconfig.toml file and add another entry into the image_repositories array for ProcessFunction with the exact same value as that of CreateFunction. So if the row looked like this before -

image_repositories = ["CreateFunction=541434768954.dkr.ecr.us-east-2.amazonaws.com/serverlessarchexample8b9687a4/createfunction286a02c8repo"]

It should now look like this -

image_repositories = ["CreateFunction=541434768954.dkr.ecr.us-east-2.amazonaws.com/serverlessarchexample8b9687a4/createfunction286a02c8repo",
"ProcessFunction=541434768954.dkr.ecr.us-east-2.amazonaws.com/serverlessarchexample8b9687a4/createfunction286a02c8repo"]

Test the changes

Build the app -

sam build

To mimic receiving an event from the queue, we invoke the lambda by passing it a sample payload.

Under the events directory, update the contents of the event.json file -

Now we run the app locally with the following command -

sam local invoke --env-vars ./tests/env.json -e ./events/event.json ProcessFunction

The output should look like -

Check the local dynamodb table to verify that the request was marked complete -

Deploying the changes

Deploy the changes to aws with the following command -

sam deploy

The output should look like this -

Just like before, test the changes by triggering a request for postman & validating the data in the dynamodb table -

You’ll notice that the message from the last test was also processed successfully.

Source Code

Here is the source code for the project created here.

Next: Part 5: Writing a CSV to S3 from AWS Lambda

DEV Community

Web Scraping with Selenium & AWS Lambda

Top comments (0)

Read next

Full-Stack Web Application with AWS Amplify: AWS Project

re:Invent 2024 Keynotes: Dr. Werner Vogels

Deploying a Node.js Application on AWS EC2 Using Tabby SSH Client

Amazon Q Developer Tips: No.14 Navigating through your prompt history