Saurav Jain

for Crawlee

Posted on Mar 6

Inside implementing SuperScraper with Crawlee.

#webdev #programming #javascript #tutorial

SuperScraper is an open-source Actor that combines features from various web scraping services, including ScrapingBee, ScrapingAnt, and ScraperAPI.

A key capability is its standby mode, which runs the Actor as a persistent API server. This removes the usual start-up times - a common pain point in many systems - and lets users make direct API calls to interact with the system immediately.

This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.

apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

A web scraping and browser automation library

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs.

Crawlee is available as the crawlee NPM package.

👉 View full documentation, guides and examples on the Crawlee project website 👈

Crawlee for Python is open for early adopters. 🐍 👉 Checkout the source code 👈.

Installation

We recommend visiting the Introduction tutorial in Crawlee documentation for more information.

Crawlee requires Node.js 16 or higher.

With Crawlee CLI

The fastest way to try Crawlee out is to use the Crawlee CLI and choose the Getting started example. The CLI…

View on GitHub

What is SuperScraper?

SuperScraper transforms a traditional scraper into an API server. Instead of running with static inputs and waiting for completion, it starts only once, stays active, and listens for incoming requests.

How to enable standby mode

To activate standby mode, you must configure the settings so it listens for incoming requests.

Server setup

The project uses Node.js http module to create a server that listens on the desired port. After the server starts, a check ensures users are interacting with it correctly by sending requests instead of running it traditionally. This keeps SuperScraper operating as a persistent server.

Handling multiple crawlers

SuperScraper processes user requests using multiple instances of Crawlee’s PlaywrightCrawler. Since each PlaywrightCrawler instance can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.

For example, if the user sends one request for “normal” proxies and one request with residential US proxies, a separate crawler needs to be created for each proxy configuration. Hence, to solve this, we store the crawlers in a key-value map, where the key is a stringified proxy configuration.

const crawlers = new Map<string, PlaywrightCrawler>();

Here’s a part of the code that gets executed when a new request from the user arrives; if the crawler for this proxy configuration exists in the map, it will be used. Otherwise, a new crawler gets created. Then, we add the request to the crawler’s queue so it can be processed.

const key = JSON.stringify(crawlerOptions); 
const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions);

await crawler.addRequests([request]);

The function below initializes new crawlers with predefined settings and behaviors. Each crawler utilizes its own in-memory queue created with the MemoryStorage client. This approach is used for two key reasons:

Performance: In-memory queues are faster, and there's no need to persist them when SuperScraper migrates.
Isolation: Using a separate queue prevents interference with the shared default queue of the SuperScraper Actor, avoiding potential bugs when multiple crawlers use it simultaneously.

export const createAndStartCrawler = async (crawlerOptions: CrawlerOptions = DEFAULT_CRAWLER_OPTIONS) => {
    const client = new MemoryStorage({ persistStorage: false });
    const queue = await RequestQueue.open(undefined, { storageClient: client });

    const proxyConfig = await Actor.createProxyConfiguration(crawlerOptions.proxyConfigurationOptions);

    const crawler = new PlaywrightCrawler({
        keepAlive: true,
        proxyConfiguration: proxyConfig,
        maxRequestRetries: 4,
        requestQueue: queue,
    });
};

At the end of the function, we start the crawler and log a message if it terminates for any reason. Next, we add the newly created crawler to the key-value map containing all crawlers, and finally, we return the crawler.

crawler.run().then(
    () => log.warning(`Crawler ended`, crawlerOptions),
    () => { }
);

crawlers.set(JSON.stringify(crawlerOptions), crawler);

log.info('Crawler ready 🚀', crawlerOptions);

return crawler;

Mapping standby HTTP requests to Crawlee requests

When creating the server, it accepts a request listener function that takes two arguments: the user’s request and a response object. The response object is used to send scraped data back to the user. These response objects are stored in a key-value map to so they can be accessed later in the code. The key is a randomly generated string shared between the request and its corresponding response object, it is used as request.uniqueKey.

const responses = new Map<string, ServerResponse>();

Saving response objects

The following function stores a response object in the key-value map:

export function addResponse(responseId: string, response: ServerResponse) {
    responses.set(responseId, response);
}

Updating crawler logic to store responses

Here’s the updated logic for fetching/creating the corresponding crawler for a given proxy configuration, with a call to store the response object:

const key = JSON.stringify(crawlerOptions); 
const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions);

addResponse(request.uniqueKey!, res);

await crawler.requestQueue!.addRequest(request);

Sending scraped data back

Once a crawler finishes processing a request, it retrieves the corresponding response object using the key and sends the scraped data back to the user:

export const sendSuccResponseById = (responseId: string, result: unknown, contentType: string) => {
    const res = responses.get(responseId);
    if (!res) {
        log.info(`Response for request ${responseId} not found`);
        return;
    }

    res.writeHead(200, { 'Content-Type': contentType });
    res.end(result);
    responses.delete(responseId);
};

Error handling

There is similar logic to send a response back if an error occurs during scraping:

export const sendErrorResponseById = (responseId: string, result: string, statusCode: number = 500) => {
    const res = responses.get(responseId);
    if (!res) {
        log.info(`Response for request ${responseId} not found`);
        return;
    }

    res.writeHead(statusCode, { 'Content-Type': 'application/json' });
    res.end(result);
    responses.delete(responseId);
};

Adding timeouts during migrations

During migration, SuperScraper adds timeouts to pending responses to handle termination cleanly.

export const addTimeoutToAllResponses = (timeoutInSeconds: number = 60) => {
    const migrationErrorMessage = {
        errorMessage: 'Actor had to migrate to another server. Please, retry your request.',
    };

    const responseKeys = Object.keys(responses);

    for (const key of responseKeys) {
        setTimeout(() => {
            sendErrorResponseById(key, JSON.stringify(migrationErrorMessage));
        }, timeoutInSeconds * 1000);
    }
};

Managing migrations

SuperScraper handles migrations by timing out active responses to prevent lingering requests during server transitions.

Actor.on('migrating', ()=>{
    addTimeoutToAllResponses(60);
});

Users receive clear feedback during server migrations, maintaining stable operation.

Build your own

This guide showed how to build and manage a standby web scraper using Apify’s platform and Crawlee. The implementation handles multiple proxy configurations through PlaywrightCrawler instances while managing request-response cycles efficiently to support diverse scraping needs.

Standby mode transforms SuperScraper into a persistent API server, eliminating start-up delays. The migration handling system keeps operations stable during server transitions. You can build on this foundation to create web scraping tools tailored to your requirements.

To get started, explore the project on GitHub or learn more about Crawlee to build your own scalable web scraping tools.

DEV Community

Inside implementing SuperScraper with Crawlee.

apify / crawlee

A web scraping and browser automation library

Installation

With Crawlee CLI

What is SuperScraper?

How to enable standby mode

Server setup

Handling multiple crawlers

Mapping standby HTTP requests to Crawlee requests

Managing migrations

Build your own

Top comments (0)

Read next

🚀 Integrasi API di Flutter: Panduan Lengkap! 📲✨

AI-Powered Code Refactoring: A Case Study Using Cursor with GPT-4o and Claude 3.7 Sonnet

React useCallback: When and how to use it for better performance

Optimizing Your Dockerfile for a Production REACT.js Application. Best practices!