DEV Community

Cover image for Web Scraping for Product Analysis and Price Comparison
Fahmi Noor Fiqri
Fahmi Noor Fiqri Subscriber

Posted on

Web Scraping for Product Analysis and Price Comparison

This is a submission for the Bright Data Web Scraping Challenge: Most Creative Use of Web Data for AI Models

Product research plays an important role in market research, search engine optimization, and for me personally, to find the best price for a product that I want to buy. For some time, I have been looking into the E-Katalog LKPP, a government-controlled online marketplace. This marketplace supposedly providing government bodies, schools, and institutions access to all kinds of products, from stationery, laptops, and many more.

One of my family members owns a laptop bought from this marketplace and, oh boy, it was crappy. It was a laptop with an obscure brand, and it was crazy expensive compared to another brand with the same price tag.

So, I turned my interest in comparing the product prices between LKPP and other online marketplace to find if there was a significant difference.

In this post, I will tell you how I used Bright Data platform to scrape the LKPP website using Scraping Browser and the Web Scraping API to collect products data from the online marketplace for comparison.

Let’s dig in!

What I Built

I built a dashboard where you can explore product statistics from multiple marketplace (LKPP, Tokopedia, Lazada) and compare them. Also, with the power of open source LLMs, we can cluster the products to uncover interesting relationships.

Overall, we can divide the process into several steps, as shown below.

Development pipeline

First, I used the Scraping Browser to collect data from E-Katalog LKPP, then, using this data, I extracted popular product keywords for search in two other marketplaces, namely Tokopedia and Lazada. For this case, I used the Web Scraping API as a convenient way to collect the products data.

After we have the data from three different sources, I used Ollama + Llama 3.1 model and DSPy to extract structured data (processor, memory, and storage) from the scraped product description. We will also use an embedding model to create text embedding and then cluster the data to explore similar products in the marketplaces.

Finally, I used Streamlit to deploy the app.

Demo

You can access the web app here.

Bright Data Hackathon

Demo: here

This repo contains the source code for my submission for Bright Data Web Scraping Hackathon at DEV.to.

Setup

Use uv to install dependencies. Clone this repo and run uv sync to install the packages.

Refer to the documentation for a guideline how to use the scripts in this repo.




The Streamlit app is divided into four sections,

Dashboard, this section shows the product price distribution, the most popular brands, GPUs, and storage.

Dashboard page

Keyword Explorer, this section contains a basic keyword research tool based on N-gram frequencies.

Keyword Explorer page

Product Cloud, this section shows a 3D product name cluster based on K-Means clustering. The points are pre-computed using T-SNE dimensionality reduction, and the embedding model used to generate the text embeddings is the Nomic Text Embed.

Product Cloud page

Compare Price: In this section, you can enter a product name and it will show a comparison between the products in three different marketplaces, along with a statistical test (t-test).

Compare Price page

How I Used Bright Data

As described in previous sections, I mainly used Bright Data’s Scraping Browser and Web Scraping API services.

Bright Data Scraping Browser excels at unlocking access to any website with its powerful unblocking and proxy features. Even though the LKPP web uses CloudFlare protection, with Scraping Browser, the scraping process runs smoothly. I used Playwright for scraping and the integration process is just as simple as changing a single line,

# from this
browser = await p.chromium.launch(headless=False, slow_mo=50)

# to this
browser = await p.chromium.connect_over_cdp("wss://AUTH_HERE@brd.superproxy.io:9222", slow_mo=50)
Enter fullscreen mode Exit fullscreen mode

Now for the public marketplace data, namely Tokopedia and Lazada, Bright Data through their Web Scraping API provides an intuitive and convenient API for scraping data, without requiring us to write a custom script for scraping. This saves me a lot of time so I can focus on analyzing the data and creating the Streamlit app.

Prize Categories

Although I filled the hackathon category to the third prompt, I believe this project could fall into any of the categories.

Final Thoughts

This has been an interesting journey, especially how we can leverage web scraping data and GenAI to extract structured information from the web. Bright Data’s powerful scraping browser and convenient web scraping API allow me to quickly build and collect a large amount of data in a very short time. This allows me to shift my focus on delivering insights from the scraped data and making web scraping process a breeze. No more CAPTCHA and creating a custom script for popular website.

Top comments (0)