Crawl4AI is an open-source, LLM-friendly web crawler and scraper built to empower developers with fast, efficient, and customizable data extraction capabilities. Whether you’re building retrieval-augmented generation (RAG) pipelines or integrating web data into AI agents, Crawl4AI is designed to handle modern web challenges like dynamic content and high concurrency. In this guide, we’ll cover everything from installation to advanced usage—all with real code examples and references.
What Is Crawl4AI?
Crawl4AI is a versatile tool for:
- Extracting Clean Markdown & Structured Data: Automatically converts HTML into Markdown, JSON, or raw HTML.
- LLM Integration: Offers both traditional CSS/XPath extraction and LLM-based strategies for complex content.
- Asynchronous Processing: Leverages concurrency to crawl multiple pages in parallel.
- Customization & Flexibility: Fine-tune browser behavior (headless mode, user agent, proxy, etc.) and extraction strategies.
Its key features make it ideal for data pipelines, AI agents, and real-time web scraping applications.
References:
Crawl4AI Documentation
Crawl4AI on GitHub
Step 1: Installation & Setup
Using Pip
Install Crawl4AI and its core dependencies using pip:
pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" transformers torch nltk
Don’t forget to export your OpenAI API key (if you plan to use LLM-based extraction):
export OPENAI_API_KEY='your_api_key_here'
Using Docker
For a containerized setup—ideal for scaling or quick deployments—you can build and run the official Docker image. For example:
- Clone the Repository:
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
- Build the Docker Image:
docker build -f Dockerfile -t crawl4ai:latest --build-arg INSTALL_TYPE=all .
- Run the Container:
docker run -p 11235:11235 -e OPENAI_API_KEY=<your-api-key> crawl4ai:latest
Reference:
Crawl4AI Docker Setup Tutorial
Step 2: Your First Crawl
Let’s start with a simple Python script to perform a basic crawl and generate Markdown output.
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
# Fetch a webpage and convert its HTML to Markdown
result = await crawler.arun("https://example.com")
print(result.markdown[:300]) # Print first 300 characters
if __name__ == "__main__":
asyncio.run(main())
This script initializes an asynchronous crawler, fetches the content from "https://example.com", and prints the first part of the generated Markdown. It’s a minimal example to get you started.
Step 3: Basic Configuration & Customization
Crawl4AI provides configurable classes to fine-tune both the browser behavior and the crawling process:
- BrowserConfig: Adjust settings like headless mode, user agent, or JavaScript execution.
- CrawlerRunConfig: Manage caching, extraction strategies, and timeouts.
Here’s an example that customizes the browser to run in headless mode and bypasses cache for fresh content:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
browser_conf = BrowserConfig(headless=True) # Use headless mode
run_conf = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun("https://example.com", config=run_conf)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
Reference:
Crawl4AI Documentation – Core Concepts
Step 4: Data Extraction Techniques
Crawl4AI supports multiple extraction strategies. Here, we cover both CSS-based and LLM-based methods.
CSS-Based Extraction
Extract structured data with simple CSS selectors:
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main():
schema = {
"name": "Products",
"baseSelector": "div.product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"}
]
}
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/products",
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema)
)
)
data = json.loads(result.extracted_content)
print(data)
if __name__ == "__main__":
asyncio.run(main())
LLM-Based Extraction
For more complex pages, you can leverage an LLM to intelligently extract data. Define a Pydantic model for your schema and use the LLM extraction strategy:
import os
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
class Product(BaseModel):
name: str = Field(..., description="Product name")
price: str = Field(..., description="Product price")
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/products",
config=CrawlerRunConfig(
cache_mode="BYPASS",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
schema=Product.schema(),
extraction_type="schema",
instruction="Extract product name and price from the page."
)
)
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
Reference:
Crawl4AI LLM Extraction Tutorial
Step 5: Multi-URL Concurrency & Dynamic Content
Multi-URL Crawling
Crawl4AI can crawl multiple pages concurrently using the arun_many()
method. This is especially useful for scraping large websites or aggregating data:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def parallel_crawl():
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
run_conf = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=True)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun_many(urls, config=run_conf):
if result.success:
print(f"URL: {result.url} - Markdown length: {len(result.markdown.raw_markdown)}")
else:
print(f"Error crawling {result.url}: {result.error_message}")
if __name__ == "__main__":
asyncio.run(parallel_crawl())
Handling Dynamic Content
For websites that load data via JavaScript (e.g., “Load More” buttons), you can inject custom JavaScript to simulate user interactions:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def dynamic_crawl():
browser_conf = BrowserConfig(headless=True, java_script_enabled=True)
js_code = """
(async () => {
const loadMore = document.querySelector("#load-more-button");
if (loadMore) {
loadMore.click();
await new Promise(resolve => setTimeout(resolve, 2000));
}
})();
"""
run_conf = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
js_code=[js_code]
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun("https://example.com/dynamic-content", config=run_conf)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(dynamic_crawl())
Reference:
Crawl4AI YouTube Tutorial
Advanced Usage: Integrating with AI Agents
Crawl4AI isn’t just for standalone web scraping—it’s built to integrate seamlessly with AI agents. With its flexible extraction strategies, you can create pipelines that:
- Scrape raw data from multiple sources.
- Structure and clean data using custom schemas.
- Feed structured data directly into AI models for further analysis.
For a deep dive into integrating Crawl4AI with AI workflows, check out additional resources on Medium or Revanth Quick Learn for practical examples.
Conclusion
Crawl4AI is a game changer for anyone looking to harness web data in an AI-driven world. In this post, we covered:
- Installation: Get started using pip or Docker.
- Basic Crawling: A simple script to convert HTML to Markdown.
- Customization: Configure browser and crawler parameters for optimal performance.
- Data Extraction: Use CSS or LLM-based strategies for structured data.
- Advanced Use Cases: Concurrency, dynamic content handling, and AI agent integration.
By leveraging Crawl4AI, you can build robust data pipelines, enhance AI models with fresh data, and unlock new possibilities for automation. Happy crawling!
Top comments (0)