Aymen K

Posted on Feb 11 • Edited on Feb 19 • Originally published at ainovae.hashnode.dev

AI Web Scraping Without Limits—Scrape Anything with Crawl4AI

#ai #python #tutorial #programming

Web scraping is one of the most in-demand skills in AI and tech today. Companies are actively seeking professionals who can extract valuable data efficiently.

In this guide, I’ll walk you through building an AI-powered web scraper using Crawl4AI, leveraging LLMs to intelligently extract and process structured data from any website—whether you’re scraping leads, gathering research data, or building a custom dataset, this will save you time and effort ⏳.

By the end, you’ll have a fully functional web scraper that can extract leads from YellowPages, process them with AI, and save the results to a CSV file—all with minimal effort.

And the best part? It costs practically nothing to run! 💡

Let’s dive in! 🔥

💡 Access the project in my Github repository now!

🤖 What is Crawl4AI?

Web scraping has come a long way, and Crawl4AI is here to take it to the next level⚡

Crawl4AI is an open-source web crawling and scraping framework designed for speed, scalability, and seamless integration with LLMs (e.g., GPT-4o, Claude). It combines traditional scraping methods with AI-driven data extraction, making it ideal for data pipelines, automation workflows, and AI agents.

🔑 Key Features

✅ LLM-Friendly Output – Generates clean Markdown-formatted data, perfect for retrieval-augmented generation (RAG) and direct ingestion into LLMs.
✅ Smart Data Extraction – Combines AI-powered parsing with traditional methods (CSS, XPath) for maximum versatility.
✅ Advanced Browser Control – Handles JavaScript-heavy websites with proxy support, session management, and stealth scraping.
✅ High Performance – Supports parallel crawling and chunk-based extraction for efficient, scalable data collection.
✅ Cost-Effective & Open-Source – Eliminates the need for costly subscriptions or expensive APIs, offering full customization and scalability without breaking the bank.

Crawl4AI empowers you to extract data intelligently, efficiently, and at scale—unlocking new possibilities for automation and AI-driven workflows. 💡

🛠️ What We Will Build

Businesses and agencies often need access to local business information to find potential clients, analyze competitors, or generate leads. Instead of manually searching directories, an AI scraper can automate this process—saving time and effort.

To showcase the power of AI web scraping with Crawl4AI, I chose to built a scraper that extracts local businesses information from Yellowpages. 🏢📊

This scraper automatically navigates through listings, collecting key details like:

🔹 Business Name

🔹 Address

🔹 Phone Number

🔹 Website & Additional Info

Once extracted, the data is structured and saved into a CSV file, making it easy to use for lead generation, market research, or business analytics.

This project demonstrates how Crawl4AI + AI models can quickly extract and process web data with minimal effort. Let’s break down how it works! 🚀

⚙️ How It Works

Before running our scraper, let's break down how the code works and give you a quick overview of how LLM-powered scraping with Crawl4AI functions.

(Don’t worry—I’ll keep it short and focus on the essential parts! 😉)

1️⃣ Browser Configuration

First, we need to configure the browser settings. Crawl4AI uses Playwright under the hood, so we get full control over how the browser behaves, including:

Headless mode (whether the browser runs in the background)
Proxy settings (to avoid getting blocked)
User agents & timeouts (to mimic real users)

Here’s how we define our browser configuration:

from crawl4ai import BrowserConfig

def get_browser_config() -> BrowserConfig:
    return BrowserConfig(
        browser_type="chromium",  # Simulate a Chromium-based browser
        headless=True,  # Run in headless mode (no UI)
        verbose=True,  # Enable detailed logs for debugging
    )

2️⃣ Defining the LLM Extraction Strategy

Now comes the AI part! 🚀 Crawl4AI allows us to use an LLM extraction strategy to tell the model exactly what to extract from each page.

Here’s how we define our strategy:

llm_strategy = LLMExtractionStrategy(
    provider="gemini/gemini-2.0-flash",  # LLM provider (Gemini, OpenAI, etc.)
    api_token=os.getenv("GEMINI_API_KEY"),  # API key for authentication
    schema=BusinessData.model_json_schema(),  # JSON schema of expected data
    extraction_type="schema",  # Use structured schema extraction
    instruction=(
        "Extract all business information: 'name', 'address', 'website', "
        "'phone number' and a one-sentence 'description' from the content."
    ),
    input_format="markdown",  # Define input format
    verbose=True,  # Enable logging for debugging
)

📌 Structuring the Output

To ensure the extracted data follows a consistent structure, we use a Pydantic model:

from pydantic import BaseModel, Field

class BusinessData(BaseModel):
    name: str = Field(..., description="The business name.")
    address: str = Field(..., description="The business address.")
    phone_number: str = Field(..., description="The business phone number.")
    website: str = Field(..., description="The business website URL.")
    description: str = Field(..., description="A short description of the business.")

💡 Why use Pydantic?

It ensures the LLM returns structured data that we can easily validate and process.

📍 Multiple LLM Choices

You noticed that in the LLM strategy, I chose to use the Gemini 2.0 Flash LLM. However, as Crawl4AI is built on LiteLLM, you can swap it with OpenAI, Claude, DeepSeek, Groq, or any other supported LLM! (See the full list here).

3️⃣ Scraping the web page

Now that we have the browser and LLM strategy set up, we need a function to scrape each page and extract business details:

async def fetch_and_process_page(
    crawler: AsyncWebCrawler,
    page_number: int,
    base_url: str,
    css_selector: str,
    llm_strategy: LLMExtractionStrategy,
    session_id: str,
    seen_names: Set[str],
) -> Tuple[List[dict], bool]:

    url = base_url.format(page_number=page_number)
    print(f"Loading page {page_number}...")

    # Fetch page content with the extraction strategy
    result = await crawler.arun(
        url=url,
        config=CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,  # No cached data
            extraction_strategy=llm_strategy,  # Define extraction method
            css_selector=css_selector,  # Target specific page elements
            session_id=session_id,  # Unique ID for the session
        ),
    )

    # Parse extracted content
    extracted_data = json.loads(result.extracted_content)

    # Process extracted businesses
    all_businesses = []
    for business in extracted_data:
        if is_duplicated(business["name"], seen_names):
            print(f"Duplicate business '{business['name']}' found. Skipping.")
            continue  # Avoid duplicates

        seen_names.add(business["name"])
        all_businesses.append(business)

    if not all_businesses:
        print(f"No valid businesses found on page {page_number}.")
        return [], False

    print(f"Extracted {len(all_businesses)} businesses from page {page_number}.")
    return all_businesses, False  # Continue crawling

This function:

Includes the necessary LLM strategy and CSS selector in the crawler config.
Loads the webpage by calling the arun method.
Extracts business details using the LLM strategy.
Filters duplicates to prevent redundant data.
Returns a list of all collected local businesses.

💡 Pro Tip

The session_id helps maintain consistent browsing behavior across pagination - crucial for websites that track user sessions!

🔍 Targeting the Right Data with CSS Selectors

To help the LLM focus on the relevant sections of the page, we use CSS selectors. These selectors allow us to pinpoint specific HTML elements that contain the desired data, ensuring a smoother and cleaner extraction for the LLM.

📸 The screenshot below shows the HTML structure of a Yellow Pages listing, where we can use CSS selectors to extract business details precisely.

4️⃣ Putting Everything Together

Scraping just one page isn’t enough—we want to crawl all businesses listings. So let’s tie everything together into a full web crawling workflow! 🎯

async def crawl_yellowpages():
    """
    Main function to scrape business data.
    """
    # Initialize configurations
    browser_config = get_browser_config()
    llm_strategy = get_llm_strategy(
        llm_instructions=SCRAPER_INSTRUCTIONS,  # Extraction instructions
        output_format=BusinessData  # Output schema
    )
    session_id = "crawler_session"

    # Initialize state variables
    page_number = 1
    all_records = []
    seen_names = set()

    # Start the web crawler session
    async with AsyncWebCrawler(config=browser_config) as crawler:
        while True:
            records, no_results_found = await fetch_and_process_page(
                crawler,
                page_number,
                BASE_URL,
                CSS_SELECTOR,
                llm_strategy,
                session_id,
                seen_names,
            )

            if no_results_found:
                print("No more records found. Stopping crawl.")
                break

            if not records:
                print(f"No records extracted from page {page_number}.")
                break  

            all_records.extend(records)
            page_number += 1  # Move to the next page

            # Stop after a maximum number of pages
            if page_number > MAX_PAGES:
                break

            # Pause to prevent rate limits
            await asyncio.sleep(2)

    # Save extracted data
    if all_records:
        save_data_to_csv(records=all_records, data_struct=BusinessData, filename="businesses_data.csv")
    else:
        print("No records found.")

    # Show LLM usage stats
    llm_strategy.show_usage()

🚀 What this function does:

Sets up the browser & LLM strategy.
Invokes fetch_and_process_page to scrape the local businesses data for each page
Runs a while loop and uses pagination to scrape multiple pages.
Saves all extracted businesses data to a CSV file.
Displays LLM usage statistics to track input/output tokens count for cost estimate.

With just a few steps, we've built a powerful AI scraper that can extract local business listings! Now, let’s put it to the test and see it in action. ⚡🔍

🔥 Try It Out!

🛠 Step 1: Clone the Project

To get started, clone the GitHub repository and install the necessary dependencies (preferably in a virtual environment):

# Clone the repository from GitHub  
git clone https://github.com/kaymen99/llm-web-scraper  
cd llm-web-scraper  

# Create a virtual environment to manage dependencies  
python -m venv venv  

# Activate the virtual environment  
source venv/bin/activate # On macOS/Linux
# On Windows: venv\Scripts\activate  

# Install the required dependencies from the requirements.txt
pip install -r requirements.txt 

# Install playwright browsers 
playwright install

🛠 Step 2: Set Up Your Environment Variables

Create a .env file in the root directory with the following content:

GEMINI_API_KEY=your_gemini_api_key_here

💡 Tip: You can use any LLM supported by LiteLLM—just ensure you provide the correct API key!

🛠 Step 3: Customize the Scraper (Optional)

Inside the project directory, you'll find a config.py file where you can modify key settings, such as:

The website URL to scrape.
The LLM provider being used.
The maximum number of pages to crawl.
Scraper instructions.

For example, to scrape different types of businesses, update the BASE_URL:

# - Plumbers in Vancouver: "https://www.yellowpages.ca/search/si/{page_number}/Plumbers/Vancouver+BC"
# - Restaurants in Montreal: "https://www.yellowpages.ca/search/si/{page_number}/Restaurants/Montreal+QC"
BASE_URL = "https://www.yellowpages.ca/search/si/{page_number}/Dentists/Toronto+ON"

To switch to a different LLM provider, update these lines:

LLM_MODEL = "gpt-4o-mini"
API_TOKEN = os.getenv("OPENAI_API_KEY")

🛠 Step 4: Run the Scraper

Start the crawler with:

python main.py

The program will:

Scrape local businesses listings page by page.
Save all extracted data to businesses_data.csv.
Display LLM tokens usage statistics after completion.

📸 See the results:

🚀 Go on, give it a spin and watch it in action!

💰 Cost Breakdown

I chose the Gemini-2.0 Flash LLM from Google for my AI scraper. Let's take a look at the token usage:

📸 Screenshot:

From the usage data, we can see that the scraper processes approximately 13,000 input tokens and 2,000 output tokens per page. Let’s calculate how much it costs to scrape a single Yellow Pages entry using our AI scraper:

Usage	Tokens Used	Pricing (per 1M tokens)	Cost
Input Tokens	13,000	$0.10	$0.0013
Output Tokens	2,000	$0.40	$0.0008
Total Cost	15,000	—	$0.0021 (≈ $0.002)

So, the cost for scraping a single page is only ~$0.002—practically free! 💸

🌍 Use Cases

Our AI web scraper isn’t just a tool—it’s a game-changer for automating web data collection. Here’s how it can be applied:

🎯 Lead Generation – Extract business details like emails, phone numbers, and addresses to build targeted outreach lists effortlessly.
📊 Market Research – Analyze trends, customer behavior, and industry insights by gathering real-time data from various sources.
⚔️ Competitor Analysis – Monitor pricing, services, and customer reviews to stay ahead in your industry.
🤖 AI Data Enrichment – Leverage LLMs to clean, categorize, and enhance scraped data for deeper insights.
📚 Research & Analysis – Extract structured data from directories, reports, and publications to fuel business or academic studies.

Whether you’re a marketer, researcher, or developer, this AI scraper streamlines data extraction—fast, efficient, and automated! 🚀

🎯 Final Thoughts

🎉 Congrats! You've successfully built your own AI-powered scraper using Crawl4AI, giving you the ability to collect as many potential leads as you need for your business or clients.

This scraper is highly adaptable—just plug in the website and specify the data you want to extract, then let it do the rest! 🚀

💡 Got ideas to improve it? Drop them in the comments!

👉 Want to learn more? Follow my blog and check out my GitHub for more AI projects & tutorials.

Happy scraping! 🔥

Forem