“Useful” data is the backbone of informed decision-making and strategic planning for any successful business. However, the real challenge isn’t just scraping data — it’s extracting valuable information from the data you’ve collected. This step often requires significant time and effort, making it a bottleneck in the process.
In this article, you’ll learn how to build a scraper that not only extracts web data but also processes and analyzes it in seconds. You’ll walk through a step-by-step process, from integrating Bright Data Scraping Browser to bypass common scraping obstacles like Captcha and IP bans to using LangChain and OpenAI for real-time data processing and building a sleek user interface with Next.js to display your results.
At the end of this article, you should have an application that can do this:
Prerequisites:
Familiarity with JavaScript and TypeScript and frameworks like React or Next.js.
Node.js (v16 or higher): Required to run the Next.js application.
npm or yarn: This is used to manage dependencies in your project.
Getting your Credentials
Before starting the project, you’ll need access credentials for the Bright Data Scraping Browser and an OpenAI API key. This section provides a step-by-step guide to obtaining these credentials. Skip this step if you already have them.
1. Bright Data Scraping Browser Access Details
To bypass anti-scraping mechanisms like this, you need the Scraping Browser.
Bright Data’s Scraping Browser is a powerful tool that automatically manages all website unlocking operations under the hood, including CAPTCHA solving, browser fingerprinting, automatic retries, selecting headers, cookies, & Javascript rendering, and more, so you can save time and resources.
Sign up: On the homepage, click “Start Free Trial”. If you already have an account, log in.
Click on “Get Proxy Products”.
- Click on the “Add” button and select “Scraping Browser.”
- Next, you will be taken to the “Add zone” page, where you will be required to choose a name for your new scraping browser proxy zone. After that, click on “Add”.
- After this, your proxy zone credentials will be created. (You will need these details in your code to bypass any anti-scraping mechanisms used on any website.
2. OpenAI API Key
OpenAI’s API will be used to process and analyze the scraped data. Here’s how to get your API key:
Create an OpenAI Account if you haven’t already done so.
Click on “Start Building” and fill in the necessary details
Once logged in, navigate to the API section in your dashboard.
In the API Keys section of the dashboard, click Create New Key.
Copy the generated key and store it in a secure location, as OpenAI will not display it again. If this is your first time, you may need to set up a billing account. Add your payment details to unlock access to GPT-4 and other models.
Building the Scraper
Now that you have your credentials ready, it’s time to dive into building the application. The project will involve creating a Next.js application, setting up components and routes, and implementing the logic for web content extraction and analysis. Here’s a breakdown of the steps:
1. Setting up the Next.js application
Start by creating a new Next.js project. While installing, initialize Typescript and Tailwind CSS, and use the src folder and App router.
npx create-next-app@latest web-content-ai-scraper
cd web-content-ai-scraper
Install the required dependencies:
npm install axios puppeteer-core dotenv cheerio @langchain/community @langchain/openai @langchain/core langchain
2. Project Structure
Organize your project with the following structure:
web-content-ai-scraper/
├── src/
│ ├── app/
│ │ ├── api/
│ │ │ └── query-web-content/
│ │ │ └── route.ts
│ │ ├── components/
│ │ │ ├── QueryResults.tsx
│ │ │ └── WebContentQuerierForm.tsx
│ │ ├── layout.tsx
│ │ └── page.tsx
├── utils/
│ └── WebContentQuerier.ts
├── .env
3. Setting Up Environment Variables
Create a .env
file in the root of your project to store your API keys and credentials securely:
OPENAI_API_KEY=<your_openai_api_key>
BRIGHT_DATA_ENDPOINT=<your_bright_data_endpoint>
Ensure you add the .env
file to your .gitignore
to prevent accidental exposure of sensitive information.
4. Implementing the Components
To handle the user input and display the results, we will create two React components that will reside in the src/app/components
folder.
1. QueryResults Component
The QueryResults.tsx
file is used to display the query results and the context sources returned from the backend.
Here’s the complete code:
import React from 'react';
interface QueryResultsProps {
answer: string;
// eslint-disable-next-line @typescript-eslint/no-explicit-any
context: any[];
}
const QueryResults: React.FC<QueryResultsProps> = ({ answer, context }) => {
return (
<div className="mt-6 space-y-4">
<div className="bg-blue-50 border-l-4 border-blue-500 p-4">
<h2 className="text-lg font-semibold text-gray-900 mb-2">
Query Result
</h2>
<p className="text-gray-700">{answer}</p>
</div>
<details className="bg-gray-50 border border-gray-200 rounded-lg">
<summary className="px-4 py-2 cursor-pointer text-gray-800 font-medium">
View Context Sources
</summary>
<div className="p-4 space-y-2">
{context.map((doc, index) => (
<div
key={index}
className="bg-white p-3 rounded-md shadow-sm border border-gray-100"
>
<p className="text-sm text-gray-600 line-clamp-2">
{doc.pageContent}
</p>
</div>
))}
</div>
</details>
</div>
);
};
export default QueryResults;
Explanation:
- The QueryResults Component accepts two props:
answer
: The processed output from the AI.context
: An array of context sources related to the query.
The query result is displayed in a highlighted box with a title.
A collapsible details section allows users to view content sources, making it clean and user-friendly.
2. WebContentQuerierForm Component
The WebContentQuerierForm.tsx
file creates a user interface for inputting the URL and query.
Here’s the complete code:
'use client';
import React, { useState } from 'react';
interface WebContentQuerierFormProps {
onSubmit: (url: string, query: string) => void;
isLoading: boolean;
}
const WebContentQuerierForm: React.FC<WebContentQuerierFormProps> = ({
onSubmit,
isLoading,
}) => {
const [url, setUrl] = useState('');
const [query, setQuery] = useState('');
const handleSubmit = (e: React.FormEvent) => {
e.preventDefault();
// Basic validation
if (!url.trim() || !query.trim()) {
alert('Please enter both a URL and a query');
return;
}
onSubmit(url, query);
};
return (
<form onSubmit={handleSubmit} className="space-y-4">
<div>
<label
htmlFor="url"
className="block text-sm font-medium text-gray-700"
>
Web Page URL
</label>
<input
type="url"
id="url"
value={url}
onChange={(e) => setUrl(e.target.value)}
placeholder="https://example.com"
required
className="mt-1 block w-full rounded-md border-gray-300 shadow-sm focus:border-indigo-300 focus:ring focus:ring-indigo-200 focus:ring-opacity-50"
/>
</div>
<div>
<label
htmlFor="query"
className="block text-sm font-medium text-gray-700"
>
Your Query
</label>
<textarea
id="query"
value={query}
onChange={(e) => setQuery(e.target.value)}
placeholder="Enter your question about the web page"
required
rows={3}
className="mt-1 block w-full rounded-md border-gray-300 shadow-sm focus:border-indigo-300 focus:ring focus:ring-indigo-200 focus:ring-opacity-50"
/>
</div>
<div>
<button
type="submit"
disabled={isLoading}
className={`w-full flex justify-center py-2 px-4 border border-transparent rounded-md shadow-sm text-sm font-medium text-white
${
isLoading
? 'bg-gray-400 cursor-not-allowed'
: 'bg-indigo-600 hover:bg-indigo-700 focus:outline-none focus:ring-2 focus:ring-offset-2 focus:ring-indigo-500'
}`}
>
{isLoading ? 'Processing...' : 'Query Web Content'}
</button>
</div>
</form>
);
};
export default WebContentQuerierForm;
Explanation:
The form includes input fields for a
URL
and aquery
.Validation ensures both fields are filled out before submission.
A
submit
button triggers theonSubmit
function, which processes the input.The button dynamically updates to indicate the loading status (
Processing…
) when the form is submitted.
The next step will involve setting up the backend API to handle and integrate the query with Bright Data and OpenAI.
5. Creating the API Route
In this section, we will set up an API route to handle the request from the frontend form, process the provided URL and query, and return the result to be displayed.
File: app/api/query-web-content/route.ts
Below is the complete code for the API route:
import { NextRequest, NextResponse } from 'next/server';
import { WebContentQuerier } from '../../../../utils/WebContentQuerier';
export async function POST(request: NextRequest) {
try {
const { url, query } = await request.json();
if (!url || !query) {
return NextResponse.json(
{ message: 'URL and query are required' },
{ status: 400 }
);
}
const querier = new WebContentQuerier();
// Extract web page content
await querier.extractWebPageContent(url);
// Query the content
const result = await querier.queryContent(query);
return NextResponse.json(result);
} catch (error) {
console.error('Error processing web content query:', error);
return NextResponse.json(
{
message: 'Error processing web content query',
error: error instanceof Error ? error.message : 'Unknown error'
},
{ status: 500 }
);
}
}
How It Works
- Request Handling:
This route uses the POST method to process data from the frontend.
The NextRequest object extracts the url and query from the request body.
2*. Input Validation*:
- If either url or query is missing, the API responds with a 400 status and an error message.
- Web Content Processing:
An instance of the
WebContentQuerier
the utility is created.The
extractWebPageContent
method fetches the content of the web page from the provided url.The
queryContent
method processes the extracted content using the provided query to generate a meaningful result.
- Error Handling:
Any errors encountered during the process are logged to the console.
The API responds with a
500
status and includes the error message for debugging.
- Response:
- On successful execution, the API sends back the processed result (e.g., the answer and related context sources) as a JSON response.
The next step is to implement the code for WebContentQuerier
utility.
6. Writing the Core Logic
The core logic of our project is encapsulated in the WebContentQuerier
class. This utility handles fetching web page content, processing, and indexing it, and performing queries to provide context-based answers. Below is the complete implementation and a breakdown of its functionality.
File: utils/WebContentQuerier.ts
import { ChatOpenAI } from "@langchain/openai";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";
import puppeteer from 'puppeteer-core';
import * as cheerio from 'cheerio';
import dotenv from 'dotenv';
// Load environment variables
dotenv.config();
export class WebContentQuerier {
private model: ChatOpenAI;
private embeddings: OpenAIEmbeddings;
private vectorStore: MemoryVectorStore | null;
constructor() {
// Initialize OpenAI model and embeddings
this.model = new ChatOpenAI({
openAIApiKey: process.env.OPENAI_API_KEY,
temperature: 0.7,
model: 'gpt-3.5-turbo',
});
this.embeddings = new OpenAIEmbeddings({
openAIApiKey: process.env.OPENAI_API_KEY,
});
this.vectorStore = null;
}
async extractWebPageContent(url: string) {
const BROWSER_WS = process.env.BROWSER_WS; // Scraping browser WebSocket URL
if (!BROWSER_WS) {
throw new Error("Scraping browser WebSocket URL is not defined in .env file.");
}
try {
console.log('Connecting to Scraping Browser...');
// Connect to the Scraping Browser
const browser = await puppeteer.connect({
browserWSEndpoint: BROWSER_WS,
});
const page = await browser.newPage();
page.setDefaultNavigationTimeout(60000);
console.log(`Navigating to URL: ${url}`);
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 });
// Extract the content
const html = await page.content();
// Use Cheerio to parse the HTML
const $ = cheerio.load(html);
const pageText = $('body').text()
.replace(/\s+/g, ' ')
.trim();
// Create a document-like object
const docs = [{
pageContent: pageText,
metadata: { source: url }
}];
// Split the document into smaller chunks
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const splitDocs = await textSplitter.splitDocuments(docs);
// Create vector store in memory
this.vectorStore = await MemoryVectorStore.fromDocuments(
splitDocs,
this.embeddings
);
console.log(`Successfully extracted and indexed content from ${url}`);
await browser.close();
return splitDocs.length;
} catch (error) {
console.error('Error extracting web page content:', error);
throw error;
}
}
async queryContent(query: string, maxResults: number = 4) {
if (!this.vectorStore) {
throw new Error(
'No content has been loaded. Call extractWebPageContent first.'
);
}
try {
// Perform similarity search
const relevantDocs = await this.vectorStore.similaritySearch(
query,
maxResults
);
// Use the relevant documents as context for the query
const context = relevantDocs.map((doc) => doc.pageContent).join('\n\n');
// Create chat messages with system and human messages
const messages = [
new SystemMessage(
'You are a helpful assistant that answers questions based strictly on the given context.'
),
new HumanMessage(
`Context:\n${context}\n\nQuery: ${query}\n\nProvide a concise and accurate answer based strictly on the context.`
),
];
// Generate an answer using the chat messages
const response = await this.model.invoke(messages);
return {
answer: response.content,
context: relevantDocs,
};
} catch (error) {
console.error('Error querying content:', error);
throw error;
}
}
// Optional method to clear the vector store
clearContent() {
this.vectorStore = null;
console.log('Vector store cleared');
}
}
export default WebContentQuerier;
Explanation
extractWebPageContent(url: string)
This method extracts and indexes the content from a web page.
-
Scraping Content:
- Connects to a Puppeteer browser using a WebSocket endpoint (
BROWSER_WS
). - Navigate to the URL and retrieve the HTML content of the page.
- Connects to a Puppeteer browser using a WebSocket endpoint (
-
Content Parsing:
- Uses Cheerio to extract and clean the textual content of the page (
$(‘body’).text()
).
- Uses Cheerio to extract and clean the textual content of the page (
-
Document Splitting:
- Splits the text into smaller chunks using
RecursiveCharacterTextSplitter
to ensure better embeddings and search performance.
- Splits the text into smaller chunks using
-
Vector Store Creation:
- Converts the split documents into embeddings using OpenAIEmbeddings and stores them in a memory-based vector store.
-
Return Value:
- Returns the number of chunks created.
queryContent(query: string, maxResults: number = 4)
This method processes user queries based on the indexed content.
-
Similarity Search:
- Uses the vector store to find the most relevant text chunks for the query.
-
Context Generation:
- Combines the content of the relevant documents into a single string, providing the context for the query.
-
Chat Completion:
- Sends the context and query as system and human messages to the OpenAI
ChatOpenAI
model. - The system message instructs the AI to respond only based on the context. You can modify the prompts for more accurate information
- Sends the context and query as system and human messages to the OpenAI
-
Return Value:
- Returns an object containing the AI’s answer and the relevant context.
7. Updating Layout and Page Files
The layout.tsx
file defines the global layout for your application. And
Here’s the code for the layout.tsx
file:
import type { Metadata } from 'next';
import { Inter } from 'next/font/google';
import './globals.css';
const inter = Inter({ subsets: ['latin'] });
export const metadata: Metadata = {
title: 'Web Content Querier',
description: 'Query web content using AI',
};
export default function RootLayout({
children,
}: {
children: React.ReactNode;
}) {
return (
<html lang="en">
<body className={`${inter.className} bg-gray-100`}>{children}</body>
</html>
);
}
The page.tsx
file defines the content and functionality for the /api/query-web-content
route in the app.
'use client';
import { useState } from 'react';
import WebContentQuerierForm from './components/WebContentQuerierForm';
import QueryResults from './components/QueryResults';
export default function Home() {
const [queryResult, setQueryResult] = useState<{
answer: string;
// eslint-disable-next-line @typescript-eslint/no-explicit-any
context: any[];
} | null>(null);
const [isLoading, setIsLoading] = useState(false);
const [error, setError] = useState<string | null>(null);
const handleQuery = async (url: string, query: string) => {
setIsLoading(true);
setError(null);
setQueryResult(null);
try {
const response = await fetch('/api/query-web-content', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({ url, query }),
});
if (!response.ok) {
const errorData = await response.json();
throw new Error(errorData.message || 'Failed to fetch query results');
}
const data = await response.json();
setQueryResult(data);
} catch (err) {
setError(
err instanceof Error ? err.message : 'An unknown error occurred'
);
} finally {
setIsLoading(false);
}
};
return (
<div className="min-h-screen py-6 flex flex-col justify-center sm:py-12">
<div className="relative py-3 sm:max-w-xl sm:mx-auto">
<div className="absolute inset-0 bg-gradient-to-r from-cyan-400 to-light-blue-500 shadow-lg transform -skew-y-6 sm:skew-y-0 sm:-rotate-6 sm:rounded-3xl"></div>
<div className="relative px-4 py-10 bg-white shadow-lg sm:rounded-3xl sm:p-20">
<div className="max-w-md mx-auto">
<div className="divide-y divide-gray-200">
<div className="py-8 text-base leading-6 space-y-4 text-gray-700 sm:text-lg sm:leading-7">
<h1 className="text-3xl font-extrabold text-center text-gray-900 mb-6">
Web Content Querier
</h1>
<WebContentQuerierForm
onSubmit={handleQuery}
isLoading={isLoading}
/>
{error && (
<div
className="bg-red-100 border border-red-400 text-red-700 px-4 py-3 rounded relative"
role="alert"
>
{error}
</div>
)}
{isLoading && (
<div className="flex justify-center items-center">
<div className="animate-spin rounded-full h-8 w-8 border-b-2 border-gray-900"></div>
</div>
)}
{queryResult && (
<QueryResults
answer={queryResult.answer}
context={queryResult.context}
/>
)}
</div>
</div>
</div>
</div>
</div>
</div>
);
}
Wrapping Thoughts
By combining AI and web scraping, you have created a powerful and efficient system for extracting and querying web content. This approach leverages the precision of AI models like OpenAI’s GPT-3.5-turbo alongside robust scraping tools like the Bright Data Scraping Browser to transform raw web data into meaningful insights. This can be useful in research and content analysis, customer support automation, and a lot more.
The code for this tutorial is available on GitHub; tweak it and improve it for your use.
Top comments (0)