Web scraping is an art, and Firecrawl is your paintbrush. It can be difficult because we’re constantly faced with blockers like JavaScript-heavy content, CAPTCHAs, and strict rate limits. Fortunately, Firecrawl is designed to address common web scraping problems. This guide will take you through Firecrawl’s capabilities, showing you how to scrape, crawl, and extract data like a pro.
Getting Started with Firecrawl
Let’s begin with a quick setup. To scrape a single page and extract clean markdown data with Firecrawl handling all the complexities in the background; use the /scrape
endpoint.
Here’s a simple example using Python:
# pip install firecrawl-py
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_API_KEY")
content = app.scrape_url("https://docs.firecrawl.dev")
print(content["data"]["markdown"]) # Outputs the scraped content in markdown format
But Firecrawl isn’t just about scraping plain web pages. Let’s dive into some advanced options that make Firecrawl truly shine.
Advanced Scraping Options
Scraping PDFs
By default, the /scrape
endpoint can extract text content from PDFs. However, if you want to skip this, simply set pageOptions.parsePDF to false.
Page Options: Fine-Tuning Your Scrape
Firecrawl gives you control over what and how you scrape. Here’s a breakdown of the key pageOptions parameters:
- onlyMainContent: Scrape the main content of a page and ignore headers, footers, and sidebars.
- includeHtml: Useful for when you need the HTML version of the content, enable this to add an html key in the response.
- includeRawHtml: For those who want raw HTML, use this option to add rawHtml key to the response.
- screenshot: This option captures a screenshot of the top of the page.
- waitFor: Sometimes pages take time to load. Use this to specify a wait time in milliseconds before scraping.
Example: Combining Page Options
Here’s how you might combine these options in a single request:
curl -X POST https://api.firecrawl.dev/v0/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization : Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"pageOptions": {
"onlyMainContent": true,
"includeHtml": true,
"includeRawHtml": true,
"screenshot": true,
"waitFor": 5000
}
}'
In this code, Firecrawl will return only the main content, including both raw and processed HTML, capture a screenshot, and wait 5 seconds for the page to fully load.
Extractor Options: Getting Structured Data
Beyond scraping, Firecrawl helps you extract structured data from any content using the extractorOptions parameter.
- mode: Choose between llm-extraction (from cleaned data) and llm-extraction-from-raw-html (directly from raw HTML).
- extractionPrompt: Describe what information you want to extract.
- extractionSchema: Define the structure of the extracted data.
Example: Extracting Data with a Schema
curl -X POST https://api.firecrawl.dev/v0/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev/",
"extractorOptions": {
"mode": "llm-extraction",
"extractionPrompt": "Extract the company mission, SSO support, open-source status, and YC status.",
"extractionSchema": {
"type": "object",
"properties": {
"company_mission": { "type": "string" },
"supports_sso": { "type": "boolean" },
"is_open_source": { "type": "boolean" },
"is_in_yc": { "type": "boolean" }
},
"required": ["company_mission", "supports_sso", "is_open_source", "is_in_yc"]
}
}
}'
This request will not only scrape the content but also extract specific pieces of information according to your defined schema. For example, this setup extracts structured information like company mission, SSO support, open-source status, and YC affiliation directly from the content.
Crawling Multiple Pages
Sometimes one page isn’t enough. That’s where the /crawl
endpoint comes in; it allows you to scrape an entire site. You can specify a base URL, and Firecrawl will handle the rest, capturing all accessible subpages.
Example: Customizing Your Crawl
This setup shows you how to customize your crawl specific options:
curl -X POST https://api.firecrawl.dev/v0/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"crawlerOptions": {
"includes": ["/blog/*", "/products/*"],
"excludes": ["/admin/*", "/login/*"],
"returnOnlyUrls": false,
"maxDepth": 2,
"mode": "fast",
"limit": 1000
}
}'
In this configuration, Firecrawl will:
- Crawl pages matching the /blog/* and /products/* subpaths.
- Skip pages matching /admin/* and /login/*.
- Crawl up to two levels deep and up to 1000 pages in total.
- Use the fast crawling mode for quicker results.
Combining Page and Crawler Options
For more control, combine pageOptions with crawlerOptions in a single request:
curl -X POST https://api.firecrawl.dev/v0/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"pageOptions": {
"onlyMainContent": true,
"includeHtml": true,
"includeRawHtml": true,
"screenshot": true,
"waitFor": 5000
},
"crawlerOptions": {
"includes": ["/blog/*", "/products/*"],
"maxDepth": 2,
"mode": "fast"
}
}'
With this setup, Firecrawl will deliver precisely the data you need, exactly how you need it.
You can get started with free $500 Firecrawl Credits (no credit card required) or you can self-host the open-source version.
Top comments (0)