Web Automation and Data Collection with Playwright (Node.js Version)
Playwright is a library for testing and automating web pages, supporting browsers like Chromium, Firefox, and WebKit. Developed by Microsoft, it is efficient, reliable, and fast, enabling cross - browser web automation tasks.
Collecting Amazon Product Information with Playwright
We can use Playwright to simulate user behavior, such as visiting Amazon (www.amazon.com) and crawling product information and reviews. By using CSS selectors or XPath, we can precisely locate web page elements and extract their text or attributes.
Example: Crawling the Amazon Best Sellers List
We will use Playwright to collect the international best - sellers list on Amazon. The steps are as follows:
- Visit the target page, for example: https://www.amazon.com/b/?ie=UTF8&node=16857165011&ref_=sv_b_3
- Select all book elements (with the class names
a-section
anda-spacing-base
) - Iterate through the book elements and extract information such as titles, prices, ratings, and the number of reviews
Deploying a Playwright Example on Leapcell
Playwright Deployment Example on Leapcell
This guide provides a streamlined approach to deploying Playwright tests on Leapcell. Follow the link above for a step-by-step tutorial.
Node.js Implementation Code
The following is the implementation of data collection using Node.js and Playwright:
const { chromium } = require('playwright');
(async () => {
// Launch the browser
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
// Visit the Amazon search page
await page.goto('https://www.amazon.com/');
// Search for the keyword "laptop"
await page.fill('#twotabsearchtextbox', 'laptop');
await page.click('#nav-search-submit-button');
// Wait for the page to finish loading
await page.waitForLoadState('networkidle');
// Get the list of product links
const links = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.s-result-item h2 a'))
.map(a => a.href);
});
// Collect product details data
const results = [];
for (const link of links) {
const productPage = await context.newPage();
await productPage.goto(link, { waitUntil: 'networkidle' });
const title = await productPage.textContent('#productTitle');
const rating = await productPage.textContent('#averageCustomerReviews .a-icon-alt').catch(() => 'N/A');
const reviewCount = await productPage.textContent('#acrCustomerReviewText').catch(() => 'N/A');
results.push({ title: title.trim(), rating, reviewCount });
await productPage.close();
}
// Output the collected data
console.log(results);
// Close the browser
await browser.close();
})();
Code Analysis
-
Initializing Playwright: Use
chromium.launch({ headless: true })
to launch the browser. -
Navigating to the Amazon Search Page: Use
page.goto()
to visit the website, fill in the search box, and submit the search. -
Extracting Product Links: Use
document.querySelectorAll()
to get the URLs of all products. -
Collecting Product Details:
- Open each product's page.
- Get the product title (
#productTitle
). - Get the rating (
#averageCustomerReviews .a-icon-alt
). - Get the number of reviews (
#acrCustomerReviewText
).
- Outputting Data and Closing the Browser
Code Optimization
-
Error Handling: Some products may not have ratings or review counts. Use
.catch(() => 'N/A')
to prevent the code from crashing. -
Automation Efficiency: Use
await context.newPage()
to reuse the context and improve page loading speed. -
Avoiding Being Blocked:
- You can use proxy access (such as Playwright's
proxy
option). - You can adjust the
userAgent
to make it more like a real user.
- You can use proxy access (such as Playwright's
Using Playwright and Node.js, we can efficiently automate Amazon web page data collection, which is suitable for scenarios such as e - commerce data analysis and competitor research.
Leapcell: The Next - Gen Serverless Platform for Web Hosting, Async Tasks, and Redis
Finally, I would like to recommend the best platform for deploying Playwright: Leapcell
1. Multi - Language Support
- Develop with JavaScript, Python, Go, or Rust.
2. Deploy unlimited projects for free
- pay only for usage — no requests, no charges.
3. Unbeatable Cost Efficiency
- Pay - as - you - go with no idle charges.
- Example: $25 supports 6.94M requests at a 60ms average response time.
4. Streamlined Developer Experience
- Intuitive UI for effortless setup.
- Fully automated CI/CD pipelines and GitOps integration.
- Real - time metrics and logging for actionable insights.
5. Effortless Scalability and High Performance
- Auto - scaling to handle high concurrency with ease.
- Zero operational overhead — just focus on building.
Explore more in the documentation!
Leapcell Twitter: https://x.com/LeapcellHQ
Top comments (0)