Scraping the web for extracted data in an automated way with tools (Puppeteer, Playwright) that aid productivity is what data scientists, software developers, and research analysts use to gather information as competitive analysis, compare prices on e-commerce websites, and build apps that send email notifications to monitor change in prices like in the travel sector.
Using Bright Data Scraping Browser and GPT (Generative Pre-trained Transformers) to gather valuable insights about products, whether yours or any other competitor, is vital to gain actionable insights that will improve customer’s needs and boost sales as a result of the feedback; both negative and positive for analysis. As an example, we will demonstrate how suggestions from GPT can be helpful from reviews posted by users on the Udemy learning platform.
Leveraging this technique serves more than just individuals; brands or companies can use it to understand what people say about their products.
Everything that you will learn in this article is for ethical purposes. And that is why Bright Data is used to turn websites into structured data that is meaningful to any user without getting blocked or rate limited or using APIs (application programming interface).
Let’s get started!
GitHub
Find the source code in this repo. Fork and clone it to test it yourself.
Note that it contains the frontend application in React in a folder called reviews, displaying the reviews and suggestions data from Udemy and GPT, respectively, and a Node server, headless-web-scraping that saves the scraped data in a JSON (JavaScript Object Notation) file.
Demo
For a practical demonstration of the client-side app, check it out here.
Prerequisites
Before building or writing a line of code, check the following requirements:
- Node.js >=16 as this would come installed with the package manager, npm
- Knowledge of JavaScript and React
- A code editor like VS Code or any other (IDE)integrated development environment
- Basic understanding of CSS
Set up Bright Data Scraping Browser
The Scraping Browser is compatible with Puppeteer and Playwright, which comes with an in-built website unblocking actions.
To begin, sign up on the Bright Data website (free), and it comes with a $20/GB “no commitment” plan.
Some of the great benefits of using Bright Data architecture are:
- Quick
- Flexible
- Cost-efficient
Discover how to leverage web scraping to your advantage.
After signup, go to your dashboard and click on the Proxies and Scraping Infrastructure icon on the window's left pane.
Next, click on the Add button dropdown and select Scraping Browser. Give the proxy a name under the Solution name field and click the Add button to continue.
The next screen will display values for the host, username, and password used to navigate the Scraping browser.
Let’s get the project running by installing the boilerplate.
Installation
Generally, in this section, you will learn the basics of initializing and creating a new boilerplate using Node.js and Vite. The web scraper in Node.js will handle the scripts for retrieving and storing the web data, while the UI (user interface) in React will display the info from the server and GPT.
In this project, create a folder that will hold both the frontend and backend code like this:
.
└── Bright_data
├── headless-web-scraping
└── reviews
Node.js
To set up a Node project, first, create a new directory with the command in the terminal:
mkdir headless-web-scraping
Next, change its directory:
cd headless-web-scraping
Initialize the project:
npm init -y
The -y
flag accepts all the defaults without the interactive prompt, which are questions for the project in the package.json
file.
The package.json
will contain all the dependencies by installing the following:
npm install dotenv puppeteer-core
-
dotenv
: This library is responsible for loading environment variables from the.env
file into theprocess.env
-
puppeteer-core
: It is an automation library without the browser itself
Now, create the index.js
file in the root directory and copy-paste this code:
index.js
console.log("Hello world!")
Before running this script, head to the package.json
file and update the script section as follows:
{
"name": "headless-web-scraping",
...
"scripts": {
"start": "node index.js"
},
...
}
Run the script:
npm run start
This should return:
Hello world!
React
The UI folder for this app is called reviews. Run this command within the directory reviews to scaffold a new Vite React project.
npm create vite@latest ./
The ./
signifies that all the files and folders should be within the folder. Also, running the command will prompt a response in the terminal. Choose the React and JavaScript options, but you can use any other framework you are comfortable using.
With the setup complete, ensure to follow the instructions in the terminal to install the dependencies and start the development server with the command:
npm install
npm run dev
Open your browser to see the UI and the server running on port 5173
.
It is time to include Tailwind CSS, a CSS utility-first framework packed with classes on the JSX used for building modern websites.
Check out this guide and follow the instructions on installing Tailwind CSS in a Vite project.
Creating a JavaScript Web Scraper in Node.js
Return to the Access parameters tab on your created zone and copy the host and username values.
Creating Environment Variables
Environment variables are essential in Node.js for storing sensitive data like secret keys and credentials from unauthorized access in development.
Copy and paste these values into the .env
file created in the root folder:
.env
AUTH="<AUTH>"
HOST="<HOST>"
To load these credentials, update the index.js
with the following:
index.js
const puppeteer = require("puppeteer-core");
require("dotenv").config();
const fs = require("fs");
const auth = process.env.AUTH;
const host = process.env.HOST;
async function run() {
let browser;
try {
browser = await puppeteer.connect({
browserWSEndpoint: `wss://${auth}@${host}`,
});
const page = await browser.newPage();
page.setDefaultNavigationTimeout(2 * 60 * 1000);
await page.goto(
"https://www.udemy.com/course/nodejs-express-mongodb-bootcamp/"
);
const reviews = await page.evaluate(() =>
Array.from(
document.querySelectorAll(
".reviews--reviews-desktop--3cOLE .review--review-container--knyTv"
),
(e) => ({
reviewerName: e.querySelector(".ud-heading-md").innerText,
reviewerText: e.querySelector(".ud-text-md span").innerText,
id: Math.floor(Math.random() * 100),
})
)
);
const outputFilename = "reviews.json"
fs.writeFile(outputFilename, JSON.stringify(reviews, null, 2), (err) => {
if (err) throw err;
console.log("file saved");
});
} catch (e) {
console.error("run failed", e);
} finally {
await browser?.close();
}
}
if (require.main == module) run();
Some things to note in the code above:
- The imported module,
puppeteer-core
,dotenv
, and thefile system
- Within the
run()
function is thepuppeteer.connect()
method is responsible for connecting to a remote browser using a proxy server (Bright Data Scraping Browser) - The
browserWSEndpoint
property is the WebSocket connection where the remote browser is running. The value passed as template literals are the parameters from the Bright Data web UI dashboard stored in the.env
, which represent the username and password
The other details from the code block above are standard Puppeteer code:
- Launch a new page
- Set the default navigation time to 2 minutes
- Go to the course page on Udemy
- Inspect the HTML page using the
page.evaluate()
method, which will loop through the elements in the DOM to get the reviewer name and the review text
- Use the
Math.floor()
method to generate a randomid
- Save the output of the result using the
fs
module in a JSON format
Run the script:
npm run start
The output is saved within the headless-web-scraping folder as reviews.json
and should look like this:
[
{
"reviewerName": "Yash U.",
"reviewerText": "This was a very intensive course covering almost all backend stuff. A huge thanks to the instructor - Jonas and also to the community. A lot of bugs and problems were already posted in the Q&A section and it helped a lot. Towards the end of the course, there were a few things that were outdated and a lot of people were disappointed in the comments but for me these things helped a lot. You learn to search and find solutions on your own and this is what is required in real world. Hence, despite these issues towards the end, I would absolutely recommend this course to anyone who wants to start learning backend development.",
"id": 11
},
{
"reviewerName": "Shyam Nath R S.",
"reviewerText": "As always with Jonas's other courses like JS, HTML and CSS I understood"
},
...
]
Using GPT
Suppose you don’t have an account. Sign up and create one.
Copy one of the reviewerText
from the object and paste it into ChatGPT. For a walkthrough, watch the video below.
You should get something similar to this:
The suggestions or improvements:
Creating the UI in React
React is a JavaScript library used by developers for building user interfaces with reusable components.
Now that we have the reviews and suggestions let’s create the UI to display the data.
In the reviews project, create a new folder called components in the src directory with the following files:
.
└── reviews
└── src
└── components
├── Footer.jsx
├── ImproveSuggestion.jsx
├── ReviewImprovementSuggestions.jsx
├── Reviews.jsx
└── Text.jsx
Also, let’s create a file for the responses from GPT in an array of objects called reviews.js
in a folder named data, ****as shown:
src/data/reviews.js
.
└── reviews
└── src
└── data
└── reviews.js
Get the entire data in this gist.
Let’s update the code in the project accordingly:
Footer.jsx
const Footer = () => {
return (
<>
<footer className='mt-auto'>
<div className='mt-5 text-center text-gray-500'>
<address>
Built by
<span className='text-blue-600'>
<a href='https://twitter.com/terieyenike' target='_'>
Teri
</a>
</span>
© 2023
</address>
<div>
<p>
Fork, clone, and star this
<a
href='https://github.com/Terieyenike/'
target='_'
rel='noopener noreferrer'
className='text-blue-600'>
<span> repo</span>
</a>
</p>
</div>
<p className='text-sm'>Bright Data .GPT .React .Tailwind CSS</p>
</div>
</footer>
</>
);
};
export default Footer;
Change the values in the JSX if you so desire.
ImproveSuggestion.jsx
const ImproveSuggestion = ({ suggestion }) => {
return (
<div>
<li className='mt-2'>{suggestion}</li>
</div>
);
};
export default ImproveSuggestion;
ReviewImprovementSuggestions.jsx
import ImproveSuggestion from "./ImproveSuggestion";
const ReviewImprovementSuggestions = ({ suggestions }) => {
return (
<div>
<h3 className='text-xl font-bold mt-3'>Improvement Suggestions:</h3>
<ul className='list-disc'>
{suggestions.map((suggestion, index) => (
<ImproveSuggestion key={index} suggestion={suggestion} />
))}
</ul>
</div>
);
};
export default ReviewImprovementSuggestions;
Reviews.jsx
import ReviewImprovementSuggestions from "./ReviewImprovementSuggestions";
const Reviews = ({ reviewerName, reviewText, improvementSuggestions }) => {
return (
<div className='mb-8'>
<h3 className='text-xl font-bold'>
<span>Reviewer name:</span>
</h3>
<p className='mb-3'>{reviewerName}</p>
<h3 className='text-xl font-bold'>
<span>Review:</span>
</h3>
<p>{reviewText}</p>
{improvementSuggestions && (
<ReviewImprovementSuggestions suggestions={improvementSuggestions} />
)}
</div>
);
};
export default Reviews;
Text.jsx
const Text = () => {
return (
<>
<div className='bg-emerald-800 text-slate-50 p-5 mb-10'>
<h1 className='text-2xl font-bold md:text-4xl'>
Using Scraping Browser and GPT for actionable product insights.
</h1>
<p className='text-sm mt-3 md:text-xl'>
Extract reviews from a specific product page{" "}
<span className='font-bold'>Udemy</span> using Bright Data, Scraping
Browser and GPT to analyze them to offer business insights.
</p>
</div>
</>
);
};
export default Text;
Some of the code snippets in the components above result from props drilling from one component to the other. Check out React documentation to learn more.
The React UI will still display the default boilerplate template in the browser. To show the current changes made to the files in the components, let’s update the entry point of the project, App.jsx
, with this code:
src/App.jsx
import Reviews from "./components/Reviews";
import Text from "./components/Text";
import Footer from "./components/Footer";
import { reviews } from "./data/reviews";
import "./App.css";
function App() {
return (
<>
<div className='flex flex-col container mx-auto max-w-6xl w-4/5 py-8 min-h-screen'>
<Text />
{reviews.map((review) => (
<Reviews
key={review.id}
reviewerName={review.reviewerName}
reviewText={review.reviewText}
improvementSuggestions={review.improvementSuggestions}
/>
))}
<Footer />
</div>
</>
);
}
export default App;
Starting the development server will display the project like this:
Conclusion
Because it avoids website bans and works seamlessly with libraries like Puppeteer, Bright Data Scraping Browser is an excellent option for developers that need to deliver high-quality scraped data.
Scraping the web presents difficulties, as accessing a company's endpoints may result in blocking. For this reason, preventive measures like CAPTCHAs and other techniques exist to safeguard user data.
In this lesson, you gained insight into inspecting a webpage element and extracting the necessary data using Node.js to gather user information from Udemy and store it in a JSON file. The project's final step was using GPT to provide insightful information and show the outcome in a user interface.
Finally, using these services and tools can serve brands, companies, or individuals on ways to adequately align their products to meet customer expectations. For the Udemy case study, GPT provided ways to improve and make the course suitable for learners. Web pages are encouraged to allow comments in the form of reviews from actual product users, which would help give a critical analysis using GPT technology.
Try the Scraping Browser today!
Top comments (0)