DEV Community

Cover image for ScrapeMate: Effortlessly Extract Data from Any Website, Even with Infinite Scroll and Complex Pagination
Shola Jegede
Shola Jegede

Posted on

ScrapeMate: Effortlessly Extract Data from Any Website, Even with Infinite Scroll and Complex Pagination

This is a submission for the Bright Data Web Scraping Challenge: Scrape Data from Complex, Interactive Websites

What I Built

ScrapeMate is a lightweight, user-friendly web scraping tool designed for anyone who needs quick and accurate data extraction. It lets users input any website URL and specify the fields they want to extract, making it a versatile solution for researchers, developers, marketers, and more.

Why I Built It

Web scraping can be a hassle, especially with interactive or complex websites. ScrapeMate simplifies this process with a minimalistic interface and powerful scraping capabilities. The idea is to make web data extraction accessible to everyone, regardless of technical expertise.

Demo

You can try ScrapeMate here: https://scrapemate.streamlit.app

Here’s how it works:

  • Enter the URL you want to scrape.
  • List the fields you need (e.g., names, prices, location, contact info).
  • Click "Launch ScrapeMate, and let ScrapeMate fetch the data for you!

Here’s a quick snapshot of ScrapeMate in action:

  • Screenshot of inputting a URL and field name:. Screenshot of inputting a URL and field names
  • Screenshot of scraping in progress: Screenshot of scraping in progress
  • Screenshot of extracted data preview: Screenshot of extracted data preview

Features

  • Simple, User-Friendly Interface (built with Streamlit UI)
  • Dynamic Content Handling (works with JavaScript-loaded pages)
  • Infinite Scroll & Pagination Support (handles endless feeds and multi-page content)
  • Batch Scraping (scrape multiple URLs at once)
  • Accurate and Structured Data Extraction (clean, precise data every time)
  • Real-Time Data Scraping (extract live data like stock prices and news updates)
  • Custom Field Selection (choose exactly what data you need)
  • Fast and Efficient Data Collection (automate data collection and save time)
  • Versatile Use Cases (ideal for researchers, developers, marketers, and content creators)
  • Data Download Options (download scraped data as CSV or JSON for easy analysis)

How I Used Bright Data

Bright Data’s robust infrastructure made it possible for ScrapeMate to handle complex, interactive websites effectively. Here’s what I focused on:

  • Dynamic Content: Many sites use JavaScript to load data, which can stump traditional scrapers. Bright Data’s Scraping Browser helped bypass these challenges seamlessly.
  • Infinite Scroll & Pagination: Websites with infinite scroll or complex pagination are notorious for frustrating scrapers. ScrapeMate overcomes this by using Bright Data’s Scraping Browser capabilities to simulate scrolling and pagination, allowing the tool to automatically load new content as needed.
  • Scalability: ScrapeMate allows users to input multiple URLs at once, and Bright Data’s support for batch requests made this process highly efficient. This means that ScrapeMate can scale effortlessly from small scraping jobs to large-scale data extraction tasks.
  • Precision: By leveraging Bright Data’s structured data outputs, ScrapeMate ensures clean, accurate results every time.

Bright Data Implementation

def setup_selenium(attended_mode=False):
    """
    Set up Selenium WebDriver for Bright Data Scraping Browser (SBR).
    """

    # Define options for Chrome
    options = ChromeOptions()

    # Apply appropriate options based on environment
    if is_running_in_docker():
        for option in HEADLESS_OPTIONS_DOCKER:
            options.add_argument(option)
    else:
        for option in HEADLESS_OPTIONS:
            options.add_argument(option)

    # Fetch Bright Data WebDriver endpoint from environment
    SBR_WEBDRIVER = os.getenv("SBR_WEBDRIVER")
    if not SBR_WEBDRIVER:
        raise EnvironmentError("SBR_WEBDRIVER environment variable is not set.")

    try:
        # Connect to Bright Data WebDriver
        print("Connecting to Bright Data Scraping Browser...")
        sbr_connection = RemoteConnection(SBR_WEBDRIVER)
        driver = WebDriver(command_executor=sbr_connection, options=options)
        print("Connected to Bright Data successfully!")
    except Exception as e:
        print(f"Failed to connect to Bright Data Scraping Browser: {e}")
        raise

    return driver
Enter fullscreen mode Exit fullscreen mode

Who Can Use ScrapeMate

  • Researchers: Save hours on data collection for papers, studies, or literature reviews.
  • Developers: Automate tasks like pulling product catalogs or monitoring site changes.
  • Marketers: Gather insights on trends, customer sentiment, or competitor strategies.
  • Content Creators: Collect ideas, references, and data for blogs or presentations.

Team Submission

This submission was made by https://dev.to/sholajegede

Access the Full Codebase

Want to explore the complete implementation and set it up for yourself? Check out the fully implemented codebase on GitHub. Feel free to clone, experiment, and adapt it to your needs. Contributions and stars are always welcome!

GitHub logo sholajegede / scrapemate

An intelligent scraping tool that extracts data from any website effortlessly using AI. Built for Researchers, content creators, analysts, and businesses.

ScrapeMate

Developed using Python and Bright Data's Scraping Browser, ScrapeMate is an intelligent scraping tool that extracts data from any website effortlessly using AI. Built for Researchers, content creators, analysts, and businesses.

Table of Contents

  1. Introduction
  2. Tech Stack
  3. Features
  4. Quick Start

Tech Stack

  • Python
  • Bright Data
  • Streamlit UI
  • Selenium
  • Groq AI
  • BeautifulSoup4
  • Pandas

Features

  • Simple, User-Friendly Interface (built with Streamlit UI)
  • Dynamic Content Handling (works with JavaScript-loaded pages)
  • Infinite Scroll & Pagination Support (handles endless feeds and multi-page content)
  • Batch Scraping (scrape multiple URLs at once)
  • Accurate and Structured Data Extraction (clean, precise data every time)
  • Real-Time Data Scraping (extract live data like stock prices and news updates)
  • Custom Field Selection (choose exactly what data you need)
  • Fast and Efficient Data Collection (automate data collection and save time)
  • Versatile Use Cases (ideal for researchers, developers, marketers, and content creators)
  • Data Download Options (download scraped data as CSV or JSON…




Top comments (10)

Collapse
 
adrianmg profile image
Noah Adrian Montgomery

are there any limits on the number of URLs I can scrape at the same time?

Collapse
 
sholajegede profile image
Shola Jegede

Right now no, it can scrape multiple urls.

Collapse
 
hy_meier profile image
Hy Meier

Is there an API for this tool? It would be awesome to integrate it into existing workflows.

Collapse
 
sholajegede profile image
Shola Jegede

Right now no, are you thinking of a specific use-case for the API or a general purpose API?

Collapse
 
abdulapopoola profile image
AbdulFattaah Popoola

Did you test it with websites that require login authentication? Do you know if that is possible?

Collapse
 
sholajegede profile image
Shola Jegede

I haven't tested it with websites that require auth yet.

Collapse
 
microsoft_outlook_e73d2c7 profile image
Stephen Rashuk

I really like the idea of being able to scrape multiple URLs at once. Does it allow you to prioritize or batch those URLs in specific groups?

Collapse
 
sholajegede profile image
Shola Jegede

Right now no, that functionality hasn't been added.

Collapse
 
volfcan profile image
volfcan

It's throwing me error
Image description

Collapse
 
sholajegede profile image
Shola Jegede

The Bright Data WEBDRIVER credits have been exhausted so I removed it.

To use it, clone it to your own computer, setup Bright Data (I think you can still get free credits if you use the link they gave for this hackathon), and then add your own WEBDRIVER url, it would work then.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.