Ali Dadashzadeh

Posted on Mar 8

Crawl4AI: AI-Ready Web Crawling

#ai #chatgpt #python #webscraping

Crawl4AI is an open-source, LLM-friendly web crawler and scraper built to empower developers with fast, efficient, and customizable data extraction capabilities. Whether you’re building retrieval-augmented generation (RAG) pipelines or integrating web data into AI agents, Crawl4AI is designed to handle modern web challenges like dynamic content and high concurrency. In this guide, we’ll cover everything from installation to advanced usage—all with real code examples and references.

What Is Crawl4AI?

Crawl4AI is a versatile tool for:

Extracting Clean Markdown & Structured Data: Automatically converts HTML into Markdown, JSON, or raw HTML.
LLM Integration: Offers both traditional CSS/XPath extraction and LLM-based strategies for complex content.
Asynchronous Processing: Leverages concurrency to crawl multiple pages in parallel.
Customization & Flexibility: Fine-tune browser behavior (headless mode, user agent, proxy, etc.) and extraction strategies.

Its key features make it ideal for data pipelines, AI agents, and real-time web scraping applications.

References:

Crawl4AI Documentation
Crawl4AI on GitHub

Step 1: Installation & Setup

Using Pip

Install Crawl4AI and its core dependencies using pip:

pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" transformers torch nltk

Don’t forget to export your OpenAI API key (if you plan to use LLM-based extraction):

export OPENAI_API_KEY='your_api_key_here'

Using Docker

For a containerized setup—ideal for scaling or quick deployments—you can build and run the official Docker image. For example:

Clone the Repository:

   git clone https://github.com/unclecode/crawl4ai.git
   cd crawl4ai

Build the Docker Image:

   docker build -f Dockerfile -t crawl4ai:latest --build-arg INSTALL_TYPE=all .

Run the Container:

   docker run -p 11235:11235 -e OPENAI_API_KEY=<your-api-key> crawl4ai:latest

Reference:

Crawl4AI Docker Setup Tutorial

Step 2: Your First Crawl

Let’s start with a simple Python script to perform a basic crawl and generate Markdown output.

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        # Fetch a webpage and convert its HTML to Markdown
        result = await crawler.arun("https://example.com")
        print(result.markdown[:300])  # Print first 300 characters

if __name__ == "__main__":
    asyncio.run(main())

This script initializes an asynchronous crawler, fetches the content from "https://example.com", and prints the first part of the generated Markdown. It’s a minimal example to get you started.

Step 3: Basic Configuration & Customization

Crawl4AI provides configurable classes to fine-tune both the browser behavior and the crawling process:

BrowserConfig: Adjust settings like headless mode, user agent, or JavaScript execution.
CrawlerRunConfig: Manage caching, extraction strategies, and timeouts.

Here’s an example that customizes the browser to run in headless mode and bypasses cache for fresh content:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    browser_conf = BrowserConfig(headless=True)  # Use headless mode
    run_conf = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)

    async with AsyncWebCrawler(config=browser_conf) as crawler:
        result = await crawler.arun("https://example.com", config=run_conf)
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

Reference:

Crawl4AI Documentation – Core Concepts

Step 4: Data Extraction Techniques

Crawl4AI supports multiple extraction strategies. Here, we cover both CSS-based and LLM-based methods.

CSS-Based Extraction

Extract structured data with simple CSS selectors:

import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    schema = {
        "name": "Products",
        "baseSelector": "div.product",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "price", "selector": ".price", "type": "text"}
        ]
    }

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/products",
            config=CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,
                extraction_strategy=JsonCssExtractionStrategy(schema)
            )
        )
        data = json.loads(result.extracted_content)
        print(data)

if __name__ == "__main__":
    asyncio.run(main())

LLM-Based Extraction

For more complex pages, you can leverage an LLM to intelligently extract data. Define a Pydantic model for your schema and use the LLM extraction strategy:

import os
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class Product(BaseModel):
    name: str = Field(..., description="Product name")
    price: str = Field(..., description="Product price")

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/products",
            config=CrawlerRunConfig(
                cache_mode="BYPASS",
                extraction_strategy=LLMExtractionStrategy(
                    provider="openai/gpt-4o",
                    api_token=os.getenv("OPENAI_API_KEY"),
                    schema=Product.schema(),
                    extraction_type="schema",
                    instruction="Extract product name and price from the page."
                )
            )
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

Reference:

Crawl4AI LLM Extraction Tutorial

Step 5: Multi-URL Concurrency & Dynamic Content

Multi-URL Crawling

Crawl4AI can crawl multiple pages concurrently using the arun_many() method. This is especially useful for scraping large websites or aggregating data:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def parallel_crawl():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ]
    run_conf = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=True)

    async with AsyncWebCrawler() as crawler:
        async for result in await crawler.arun_many(urls, config=run_conf):
            if result.success:
                print(f"URL: {result.url} - Markdown length: {len(result.markdown.raw_markdown)}")
            else:
                print(f"Error crawling {result.url}: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(parallel_crawl())

Handling Dynamic Content

For websites that load data via JavaScript (e.g., “Load More” buttons), you can inject custom JavaScript to simulate user interactions:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def dynamic_crawl():
    browser_conf = BrowserConfig(headless=True, java_script_enabled=True)
    js_code = """
    (async () => {
        const loadMore = document.querySelector("#load-more-button");
        if (loadMore) {
            loadMore.click();
            await new Promise(resolve => setTimeout(resolve, 2000));
        }
    })();
    """
    run_conf = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        js_code=[js_code]
    )

    async with AsyncWebCrawler(config=browser_conf) as crawler:
        result = await crawler.arun("https://example.com/dynamic-content", config=run_conf)
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(dynamic_crawl())

Reference:

Crawl4AI YouTube Tutorial

Advanced Usage: Integrating with AI Agents

Crawl4AI isn’t just for standalone web scraping—it’s built to integrate seamlessly with AI agents. With its flexible extraction strategies, you can create pipelines that:

Scrape raw data from multiple sources.
Structure and clean data using custom schemas.
Feed structured data directly into AI models for further analysis.

For a deep dive into integrating Crawl4AI with AI workflows, check out additional resources on Medium or Revanth Quick Learn for practical examples.

Conclusion

Crawl4AI is a game changer for anyone looking to harness web data in an AI-driven world. In this post, we covered:

Installation: Get started using pip or Docker.
Basic Crawling: A simple script to convert HTML to Markdown.
Customization: Configure browser and crawler parameters for optimal performance.
Data Extraction: Use CSS or LLM-based strategies for structured data.
Advanced Use Cases: Concurrency, dynamic content handling, and AI agent integration.

By leveraging Crawl4AI, you can build robust data pipelines, enhance AI models with fresh data, and unlock new possibilities for automation. Happy crawling!

DEV Community

Crawl4AI: AI-Ready Web Crawling

What Is Crawl4AI?

Step 1: Installation & Setup

Using Pip

Using Docker

Step 2: Your First Crawl

Step 3: Basic Configuration & Customization

Step 4: Data Extraction Techniques

CSS-Based Extraction

LLM-Based Extraction

Step 5: Multi-URL Concurrency & Dynamic Content

Multi-URL Crawling

Handling Dynamic Content

Advanced Usage: Integrating with AI Agents

Conclusion

References & Useful Links

Top comments (0)

Read next

Integrate SpringAI with DeepSeek

ChatGPT vs. DeepSeek : Which is best?

Set Up Python in VS Code—Fast, Easy & Beginner-Friendly!

ICC-CT Final Tech: How JioHotstar Handled Millions of Viewers