DEV Community

Cover image for Playwright Amazon Scraper: Products & Reviews (Javascript)
Leapcell
Leapcell

Posted on

Playwright Amazon Scraper: Products & Reviews (Javascript)

Image description

Web Automation and Data Collection with Playwright (Node.js Version)

Playwright is a library for testing and automating web pages, supporting browsers like Chromium, Firefox, and WebKit. Developed by Microsoft, it is efficient, reliable, and fast, enabling cross - browser web automation tasks.

Collecting Amazon Product Information with Playwright

We can use Playwright to simulate user behavior, such as visiting Amazon (www.amazon.com) and crawling product information and reviews. By using CSS selectors or XPath, we can precisely locate web page elements and extract their text or attributes.

Example: Crawling the Amazon Best Sellers List

We will use Playwright to collect the international best - sellers list on Amazon. The steps are as follows:

  1. Visit the target page, for example: https://www.amazon.com/b/?ie=UTF8&node=16857165011&ref_=sv_b_3
  2. Select all book elements (with the class names a-section and a-spacing-base)
  3. Iterate through the book elements and extract information such as titles, prices, ratings, and the number of reviews

Deploying a Playwright Example on Leapcell

Playwright Deployment Example on Leapcell

This guide provides a streamlined approach to deploying Playwright tests on Leapcell. Follow the link above for a step-by-step tutorial.

Node.js Implementation Code

The following is the implementation of data collection using Node.js and Playwright:

const { chromium } = require('playwright');

(async () => {
    // Launch the browser
    const browser = await chromium.launch({ headless: true });
    const context = await browser.newContext();
    const page = await context.newPage();

    // Visit the Amazon search page
    await page.goto('https://www.amazon.com/');

    // Search for the keyword "laptop"
    await page.fill('#twotabsearchtextbox', 'laptop');
    await page.click('#nav-search-submit-button');

    // Wait for the page to finish loading
    await page.waitForLoadState('networkidle');

    // Get the list of product links
    const links = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('.s-result-item h2 a'))
            .map(a => a.href);
    });

    // Collect product details data
    const results = [];
    for (const link of links) {
        const productPage = await context.newPage();
        await productPage.goto(link, { waitUntil: 'networkidle' });

        const title = await productPage.textContent('#productTitle');
        const rating = await productPage.textContent('#averageCustomerReviews .a-icon-alt').catch(() => 'N/A');
        const reviewCount = await productPage.textContent('#acrCustomerReviewText').catch(() => 'N/A');

        results.push({ title: title.trim(), rating, reviewCount });

        await productPage.close();
    }

    // Output the collected data
    console.log(results);

    // Close the browser
    await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

Code Analysis

  • Initializing Playwright: Use chromium.launch({ headless: true }) to launch the browser.
  • Navigating to the Amazon Search Page: Use page.goto() to visit the website, fill in the search box, and submit the search.
  • Extracting Product Links: Use document.querySelectorAll() to get the URLs of all products.
  • Collecting Product Details:
    • Open each product's page.
    • Get the product title (#productTitle).
    • Get the rating (#averageCustomerReviews .a-icon-alt).
    • Get the number of reviews (#acrCustomerReviewText).
  • Outputting Data and Closing the Browser

Code Optimization

  1. Error Handling: Some products may not have ratings or review counts. Use .catch(() => 'N/A') to prevent the code from crashing.
  2. Automation Efficiency: Use await context.newPage() to reuse the context and improve page loading speed.
  3. Avoiding Being Blocked:
    • You can use proxy access (such as Playwright's proxy option).
    • You can adjust the userAgent to make it more like a real user.

Using Playwright and Node.js, we can efficiently automate Amazon web page data collection, which is suitable for scenarios such as e - commerce data analysis and competitor research.

Leapcell: The Next - Gen Serverless Platform for Web Hosting, Async Tasks, and Redis

Image description

Finally, I would like to recommend the best platform for deploying Playwright: Leapcell

1. Multi - Language Support

  • Develop with JavaScript, Python, Go, or Rust.

2. Deploy unlimited projects for free

  • pay only for usage — no requests, no charges.

3. Unbeatable Cost Efficiency

  • Pay - as - you - go with no idle charges.
  • Example: $25 supports 6.94M requests at a 60ms average response time.

4. Streamlined Developer Experience

  • Intuitive UI for effortless setup.
  • Fully automated CI/CD pipelines and GitOps integration.
  • Real - time metrics and logging for actionable insights.

5. Effortless Scalability and High Performance

  • Auto - scaling to handle high concurrency with ease.
  • Zero operational overhead — just focus on building.

Image description

Explore more in the documentation!

Leapcell Twitter: https://x.com/LeapcellHQ

Top comments (0)