Web scraping real-world examples without using AI
Hello people,
I was working extensively on APIs, and have been scraping for a few weeks a couple of articles on AI agents on web scraping, web scraping APIs paid and free including and other web scraping APIs are shared on the platform.
But this is something else we will be talking about today.
For the past 2 years we have been writing newsletters once a week to our subscriber’s community and each of the newsletters or at alteast 90% of them if not all have some web resource links and some of them are domains, tools, agents, github links, youtube videos etc.
In today’s blog, we will go through the process of recollecting the 200+ links from the previous return newsletter to our subscribers on our website.
Introduction
Let’s go deep into the quick introduction of the problem statement.
Problem: Collect all the useful and unique links only links or more specifically unique domains from all the emails stored in the database.
Database: Firebase Firestore
Backend: Express/node.js
Quick Solution: Fetch the data from the database, extract links from the email content, filter unique links in the new set and finally store all of them in the new database collection.
{
id: uuid,
createdAt: // Firebase firestore timestamp,
data: [],
subject: ""
}
Each object schema contains id as string, createdAt as server timestamp as an object with nanoseconds and milliseconds keys, data is an array object containing the email body or content and subject is each email subject.
Once you understand this we will move ahead with the data array object.
const data = [ blocks: [
{ "id": "header1", "type": "header", "data":
{ "text": "Welcome to Editor.js", "level": 2 } }
],
time: "",
version: ""
]
The good question to ask over here is why our data object is again so complicated. We are using the editorjs module to create a notion-based editor to write emails, it feels cool to write emails on good editors so we create the one using the editorjs module.
Editorjs module returns the output data object as shown above with blocks as an array containing the blocks and time and version as the string.
Each block in the above blocks array contains the block with data, id and type key and the data key contains the content such as paragraph inner text or inner HTML, image source, link URL and placeholder, button label and redirect URL etc. If this seems too tough, keep reading to understand it better.
const block = [
{
data: {
type: "paragraph",
data: {
text: "The first week of Feb 2023 begins, If you wonder why I always try to mention the date of the day in the letter is to remind you and me about the time & date regularly."
},
id: "hINeYPKvBW"
}
},
{
data: {
type: "link",
data: {
link: "www.google.com",
},
id: "hINeYPKvBW"
}
}
]
The id is the random id generated by the editors and the entire blocks object is generated by editorjs on the go we are so far directly storing that in the database which is not a good choice in terms of performance but we have other reasons to prefer the database schema in the same fashion as frontend. Yes, we in iHateReading, believe that the front end should dictate backend APIs and database schema more importantly than backend developers which is again a controversial statement but never mind.
The code added below is the sample editorjs data block array below to understand better how each type of HTML DOM element is stored in an array as a block in editorjs, hence these modules are named block-scope editors and the Notion app uses the same concept or editor by having block-scope editors.
{
"time": 1672531200000,
"blocks": [
{
"id": "header1",
"type": "header",
"data": {
"text": "Welcome to Editor.js",
"level": 2
}
},
{
"id": "paragraph1",
"type": "paragraph",
"data": {
"text": "Editor.js is a block-style editor that provides flexibility in content creation. Here is an example of different blocks you can use."
}
},
{
"id": "image1",
"type": "image",
"data": {
"file": {
"url": "https://via.placeholder.com/800x400"
},
"caption": "This is a sample image block.",
"withBorder": true,
"stretched": false,
"withBackground": false
}
},
{
"id": "link1",
"type": "linkTool",
"data": {
"link": "https://editorjs.io",
"meta": {
"title": "Editor.js",
"description": "A block-style editor for clean and structured content creation.",
"image": {
"url": "https://editorjs.io/images/og-image.png"
}
}
}
},
{
"id": "button1",
"type": "button",
"data": {
"text": "Click Me",
"url": "https://example.com"
}
},
{
"id": "header2",
"type": "header",
"data": {
"text": "Getting Started",
"level": 3
}
},
{
"id": "paragraph2",
"type": "paragraph",
"data": {
"text": "You can add more blocks, rearrange them, and style them as per your needs. Start by exploring all the block options available in Editor.js!"
}
}
],
"version": "2.27.2"
}
Once it’s clear how our data is stored in the database, we need to finally fetch it in the node or express.js app to get the links from each email.
{
"id": "header1",
"type": "header",
"data": {
"text": "Welcome to Editor.js",
"level": 2
}
},
Fetch emails is just one line of code using firestore SDK in the node.js i.e. firebase-admin.
const newsletters = await admin
.firestore()
.collection("Emails")
.orderBy("createdAt", "desc")
.get();
const emails = newsletters.docs.map((item) => item.data());
Extracting links from Emails
The emails array contains the array of email objects and now we will first convert each email object data key object into an HTML string to extract the links further.
How to convert the data array in the HTML string, which is again simple, using brute force, we can loop over each object of the editor’s data object convert it into an HTML string and append the entire string together in the end,
You can use open-source editorjs packages such as the editor-html module but in our case, we have custom blocks as well such as Button so for that we decided to create our editorjs to HTML parser.
This parser simply runs over the editorjs blocks and converts them into a DOM element in string format and appends the blocks together to frame the final HTML content string.
If you didn’t understand, see the code below, which fetches all emails over from the database, followed by looping over editorjs for each email and converting the editor’s data object into an HTML string using convertToHtml method.
const newsletters = await admin
.firestore()
.collection("Emails")
.orderBy("createdAt", "desc")
.get();
const emails = newsletters.docs.map((item) => item.data());
let links = new Set();
emails.forEach((block) => {
let html = convertDataToHtml(block.data.blocks);
// HTML content is the each email object content in the HTML string format
});
Conver to HTML method simply takes the blocks array of editorjs as input and parses it into HTML string using for loop again.
export const convertDataToHtml = (blocks) => {
let convertedHtml = ``;
if (blocks !== undefined && blocks !== null) {
blocks?.forEach((block) => {
switch (block.type) {
case "header":
convertedHtml += `<h${block.data.level} style="font-weight:600; margin:20px 0px;">${block.data.text}</h${block.data.level}>`;
break;
case "embed":
convertedHtml += `<div><iframe width="100%" height="400" src="${block.data.embed}" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe></div>`;
break;
case "code":
convertedHtml += `<pre style="background: #f4f4f4; padding: 10px; border-radius: 4px; overflow: auto;"><code style="font-family: 'Courier New', Courier, monospace; white-space: pre-wrap;">${block.data.code}</code></pre>`;
break;
case "paragraph":
convertedHtml += `<p style="margin:20px 0px">${block.data.text} </p>`;
break;
case "delimiter":
convertedHtml += `<div style="margin:auto">* * *</div>`;
break;
case "image":
convertedHtml += `<img src="${block.data.file.url}" style="width: 100%; max-height: 600px; margin:20px 0px; object-fit:contain" /><div style="text-align:center">${block.data.caption}</div>`;
break;
case "checklist":
convertedHtml += `<ul style="list-style-type:none; padding-left: 0;">`;
block.data.items.forEach((item) => {
const isChecked = item.checked ? "checked" : "";
convertedHtml += `<li><input type="checkbox" ${isChecked}> ${item.text}</li>`;
});
convertedHtml += `</ul>`;
break;
case "list":
convertedHtml += `<ul style="margin:10px">`;
block.data.items.forEach(function (li) {
convertedHtml += `<li style="list-style:inside; margin:4px 0px;">${li}</li>`;
});
convertedHtml += "</ul>";
break;
case "hyperlink":
convertedHtml += `<a href=${block.data.link} target="_blank" style="color:black; margin:4px 0px; font-weight:bold; text-decoration:underline;">${block.data.text}</a>`;
break;
case "link":
convertedHtml += `<a href=${block.data.link} style="color:black; margin:4px 0px; font-weight:bold; text-decoration:underline;">${block.data.text}</a>`;
break;
case "gist":
convertedHtml += `<div class="gistcontainer" id="gist1"><script src=${block.data.gistLink}></script></div>`;
break;
case "button":
convertedHtml += `<div style="margin:10px 0px; cursor:pointer;"><a target="_blank" href=${block.data.link}><button style="background:black; text-decoration: none; color: white; border-radius:4px; display:flex;justify-content: center; margin: auto; text-align:center; padding:10px; border:none">${block.data.text}</button></a></div>`;
break;
default:
console.log("Unknown block type", block);
break;
}
});
}
return convertedHtml;
};
Now it should make sense why conversion is important and why we prefer custom conversion parser of editorjs into HTML.
A good alternative to our approach is to store this HTML string directly in the database and render the HTML as the email body. We are doing the same but in order to get full control over each content object we prefer storing editorjs as it is in the database.
Filtering Unique Links
The next part is to extract the links from one HTML string first repeat the same process for all emails and return all the unique links.
HTMl string contains “a” tag which is a link DOM element that contains links and in our case, we will use Cheerio as the npm library for manipulating and parsing HTML.
Cheerio is a widely used web scraping and reading HTML file module so we prefer the one used and accepted in the industry.
Next, using Cheerio we can simply extract all the links from the HTML string as shown below.
const extractLinks = (html) => {
const $ = load(html);
const links = [];
$("a").each((index, element) => {
const link = $(element).attr("href");
if (link) {
links.push(link.trim());
}
});
return links;
};
This will return all the links present in an HTML string, and we will then push the links in an array and simply return the final links array.
Finally using for each loop, we iterate over each HTML string to extract the links and finally store all links in a set. Why we are using set over map is a good question to ask and one can simply google the difference between sets and arrays in javascript to get the purpose behind our choice of sets over arrays.
export const getUniqueLinksFromNewsletters = async (req, res) => {
try {
const newsletters = await admin
.firestore()
.collection("Emails")
.orderBy("createdAt", "desc")
.get();
const emails = newsletters.docs.map((item) => item.data());
let links = new Set();
emails.forEach((block) => {
let html = convertDataToHtml(block.data.blocks);
extractLinks(html).forEach((link) => {
try{
const url = new URL(link);
if (link && url) {
const domain = url.hostname;
links.add(link);
}
}catch(e){
console.log(e, "error in link")
}
});
});
res.send([...links]);
} catch (e) {
console.log(e, "error");
res.send("Error");
}
};
Data Cleaning
Data cleaning is done by removing unwanted domains. How do we do that?
It should make sense now; we will be using domain names only to filter the uniqueness among the links, and hence our work is done.
How to get the domain name from the URL, simply google that this is the exercise for today’s blog or ask chatGPT.
Using the set has property we will filter the unwanted domains as well for example, I’ve removed all the domains backlinking to our website domain ihatereading.in and others as well as shown below.
const extractLinks = (html) => {
const $ = load(html);
const links = [];
$("a").each((index, element) => {
const link = $(element).attr("href");
if (link) {
links.push(link.trim());
}
});
return links;
};
const unwantedDomains = new Set([
"www.ihatereading.in",
"4pn312wb.r.us-east-1.awstrack.me",
"url6652.creators.gumroad.com",
"ehhhejh.r.af.d.sendibt2.com",
"tracking.tldrnewsletter.com",
]);
export const getUniqueLinksFromNewsletters = async (req, res) => {
try {
const newsletters = await admin
.firestore()
.collection("Emails")
.orderBy("createdAt", "desc")
.get();
const emails = newsletters.docs.map((item) => item.data());
let links = new Set();
emails.forEach((block) => {
let html = convertDataToHtml(block.data.blocks);
extractLinks(html).forEach((link) => {
try{
const url = new URL(link);
if (link && url) {
const domain = url.hostname;
if (!unwantedDomains.has(domain)) {
links.add(link);
}
}
}catch(e){
console.log(e, "error in link")
}
});
});
res.send([...links]);
} catch (e) {
console.log(e, "error");
res.send("Error");
}
};
In this way, we have fetched some 600+ unique links from your 100+ emails and all of those links are important for research, reading, and learning
Once this is done, we have a unique set of 600 important links as a collection of resources which we can sample or use further for research.
Conclusion
For the past 2 years, we have been writing tech tech-savvy newsletters to our subscribers and weekly emails about programming news in the front, AI, backend, jobs and mobile app development.
Each email contains special product links or resources cool tweets or important YouTube video links which we can use for a lot of research purposes.
We create a pipeline of 4 steps to fetch the emails, convert them into HTML string, all the links from HTML string and finally filter and return the unique list of links.
Collection: https://ihatereading.in/collection/universo
Originally published on ihatereading.in
I am consistently writing a lot about programming on our dedicated website iHateReading
That’s it for today, see you in the next one
Shrey
Top comments (0)