This blog was initially posted to Crawlbase Blog
Looking to grow your business? SuperPages is a great place to get valuable lead info. SuperPages is one of the largest online directories with listings of businesses across the US. With millions of businesses categorized by industry, location, and more, it’s a good place to find detailed info on potential customers or clients.
In this guide, we’ll show you how to scrape SuperPages to get business information. With Python and a few simple libraries, you can get business names, phone numbers, addresses, and more. This will give you a list of leads to expand your marketing or build partnerships.
Once we have the core scraper set up, we’ll also look into optimizing our results using Crawlbase Smart Proxy to ensure data accuracy and efficiency when handling larger datasets.
Why Scrape SuperPages for Leads?
SuperPages is a top US business directory with millions of listings across industries. Whether you’re in sales, marketing, or research, SuperPages has the information you need to create targeted lead lists for outreach. From small local businesses to national companies, SuperPages has millions of entries, each with a business name, address, phone number, and business category.
By scraping SuperPages, you can collect all this information in one place, save time on manual searching, and focus on reaching out to prospects. Instead of browsing page after page, you’ll have a structured dataset ready for analysis and follow-up.
Let’s dive in and see what information you can extract from SuperPages.
Key Data to Extract from SuperPages
When scraping SuperPages, you need to know what data to extract for lead generation. SuperPages has multiple pieces of data for each business, and by targeting specific fields, you can create a clean dataset for outreach and marketing purposes.
Here are some of the main data fields:
- Business Name: The primary identifier for each business so you can group your leads.
- Category: SuperPages categorizes businesses by industry, e.g., “Restaurants” or “Legal Services.” This will help you segment your leads by industry.
- Address and Location: Full address details, including city, state, and zip code, so you can target local marketing campaigns.
- Phone Number: Important for direct contact, especially if you’re building a phone-based outreach campaign.
- Website URL: Many listings have a website link, so you have another way to engage and get more info about each business.
- Ratings and Reviews: If available, this data can give you insight into customer sentiment and reputation so you can target businesses based on their quality and customer feedback.
With a clear idea of what to extract, we’re ready to set up our Python environment in the next section.
Setting Up Your Python Environment
Before we can start scraping SuperPages data, we need to set up the right Python environment. This includes installing Python, necessary libraries and an Integrated Development Environment (IDE) to write and run our code.
Installing Python and Required Libraries
First, make sure you have Python installed on your computer. You can download the latest version from python.org. After installation, you can test if Python is working by running this command in your terminal or command prompt:
python --version
Next, you will need to install the required libraries. For this tutorial, we will use Requests for making HTTP requests and BeautifulSoup for parsing HTML. You can install these libraries by running the following command:
pip install requests beautifulsoup4
These libraries will help you interact with SuperPages and scrape data from the HTML.
Choosing an IDE
To write and run your Python scripts, you need an IDE. Here are some options:
- VS Code: A lightweight code editor with good Python support and many extensions.
- PyCharm: A more full-featured Python IDE with code completion and debugging tools.
- Jupyter Notebook: An interactive environment for experimentation and visualization.
Choose the IDE that you prefer. Once your environment is set up, you're ready to start writing the code for scraping SuperPages listings.
Scraping SuperPages Listings
In this section, we’ll cover scraping SuperPages listings. This includes inspecting the HTML to find the selectors, writing the scraper, handling pagination to get data from multiple pages, and saving the data in a JSON file for easy access.
Inspecting HTML for Selectors
Before we start writing the scraper, we need to inspect the SuperPages listings page to find the HTML structure and CSS selectors that contain the data we want. Here’s how:
- Open the Listings Page: Go to a SuperPages search results page (e.g., search for “Home Services” in a location you’re interested in).
-
Inspect the Page: Right-click on the page and select “Inspect” or press
Ctrl + Shift + I
to open Developer Tools.
- Find the Relevant Elements:
-
Business Name: The business name is in an
<a>
tag with the class.business-name
, and within this<a>
, the name itself is in a<span>
tag. -
Address: The address is in a
<span>
tag with the class.street-address
. -
Phone Number: The phone number is in an
<a>
tag with the classes.phone
and.primary
. -
Website Link: If available, the business website link is in an
<a>
tag with the class.weblink-button
. -
Detail Page Link: The link to the business detail page is in an
<a>
tag with the class.business-name
.
Look for any other data you want to extract, such as ratings or business hours. Now, you’re ready to write the scraper in the next section.
Writing the Listings Scraper
Now that we have the selectors, we can write the scraper. We’ll use requests
to fetch the page and BeautifulSoup
to parse the HTML and extract the data. Here’s the basic code to scrape listings:
import requests
from bs4 import BeautifulSoup
import json
# Function to fetch listings from a single page
def fetch_listings(page_number):
url = f"https://www.superpages.com/search?search_terms=Home%20Services&geo_location_terms=Los%20Angeles%2C%20CA&page={page_number}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0"
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, "html.parser")
listings = []
for business in soup.select("div.search-results > div.result"):
name = business.select_one("a.business-name span").text.strip() if business.select_one("a.business-name span") else ""
address = business.select_one("span.street-address").text.strip() if business.select_one("span.street-address") else ""
phone = business.select_one("a.phone.primary").text.strip() if business.select_one("a.phone.primary") else ""
website = business.select_one("a.weblink-button")["href"] if business.select_one("a.weblink-button") else ""
detail_page_link = 'https://www.superpages.com' + business.select_one("a.business-name")["href"] if business.select_one("a.business-name") else ""
listings.append({
"name": name,
"address": address,
"phone": phone,
"website": website,
"detail_page_link": detail_page_link
})
return listings
else:
print("Failed to retrieve page.")
return []
This code fetches data from a given page of results. It extracts each business’s name, address, phone number, and website and stores them in a list of dictionaries.
Handling Pagination
To get more data, we need to handle pagination so the scraper can go through multiple pages. SuperPages changes the page number in the URL, so it’s easy to add pagination by looping through page numbers. We can create a function like the one below to scrape multiple pages:
# Function to fetch listings from multiple pages
def fetch_all_listings(total_pages):
all_listings = []
for page in range(1, total_pages + 1):
print(f"Scraping page {page}...")
listings = fetch_listings(page)
all_listings.extend(listings)
return all_listings
Now, fetch_all_listings()
will gather data from the specified number of pages by calling fetch_listings()
repeatedly.
Saving Data in a JSON File
Once we’ve gathered all the data, it’s important to save it in a JSON file for easy access. Here’s how to save the data in JSON format:
# Function to save listings data to a JSON file
def save_to_json(data, filename="superpages_listings.json"):
with open(filename, "w") as file:
json.dump(data, file, indent=4)
print(f"Data saved to {filename}")
This code saves the data in a file named superpages_listings.json
. Each entry will have the business name, address, phone number, and website.
Complete Code Example
Below is the complete code that combines all the steps:
import requests
from bs4 import BeautifulSoup
import json
# Function to fetch listings from a single page
def fetch_listings(page_number):
url = f"https://www.superpages.com/search?search_terms=Home%20Services&geo_location_terms=Los%20Angeles%2C%20CA&page={page_number}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0"
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, "html.parser")
listings = []
for business in soup.select("div.search-results > div.result"):
name = business.select_one("a.business-name span").text.strip() if business.select_one("a.business-name span") else ""
address = business.select_one("span.street-address").text.strip() if business.select_one("span.street-address") else ""
phone = business.select_one("a.phone.primary").text.strip() if business.select_one("a.phone.primary") else ""
website = business.select_one("a.weblink-button")["href"] if business.select_one("a.weblink-button") else ""
detail_page_link = 'https://www.superpages.com' + business.select_one("a.business-name")["href"] if business.select_one("a.business-name") else ""
listings.append({
"name": name,
"address": address,
"phone": phone,
"website": website,
"detail_page_link": detail_page_link
})
return listings
else:
print("Failed to retrieve page.")
return []
# Function to fetch listings from multiple pages
def fetch_all_listings(total_pages):
all_listings = []
for page in range(1, total_pages + 1):
print(f"Scraping page {page}...")
listings = fetch_listings(page)
all_listings.extend(listings)
return all_listings
# Function to save listings data to a JSON file
def save_to_json(data, filename="superpages_listings.json"):
with open(filename, "w") as file:
json.dump(data, file, indent=4)
print(f"Data saved to {filename}")
# Main function to run the complete scraper
def main():
total_pages = 5 # Define the number of pages you want to scrape
all_listings_data = fetch_all_listings(total_pages)
save_to_json(all_listings_data)
# Run the main function
if __name__ == "__main__":
main()
Example Output:
{
"name": "Evergreen Cleaning Systems",
"address": "3325 Wilshire Blvd Ste 622, Los Angeles, CA 90010",
"phone": "213-375-1597Call Now",
"website": "https://www.evergreencleaningsystems.com",
"detail_page_link": "https://www.superpages.com/los-angeles-ca/bpp/evergreen-cleaning-systems-540709574?lid=1002188497939"
},
{
"name": "Merry Maids",
"address": "14741 Kittridge Street, Van Nuys, CA 91405",
"phone": "818-465-8982Call Now",
"website": "http://www.merrymaids.com",
"detail_page_link": "https://www.superpages.com/van-nuys-ca/bpp/merry-maids-542022905?lid=1002108319143"
},
{
"name": "Any Day Anytime Cleaning Service",
"address": "27612 Cherry Creek Dr, Santa Clarita, CA 91354",
"phone": "661-297-2702Call Now",
"website": "",
"detail_page_link": "https://www.superpages.com/santa-clarita-ca/bpp/any-day-anytime-cleaning-service-513720439?lid=1002021283815"
},
{
"name": "Ultrasonic Blind Services",
"address": "2049 Pacific Coast Hwy, Ste 217, Lomita, CA 90717",
"phone": "424-257-6603Call Now",
"website": "http://www.ultrasonicblindservices.com",
"detail_page_link": "https://www.superpages.com/lomita-ca/bpp/ultrasonic-blind-services-514581803?lid=1002166431055"
},
.... more
]
Scraping SuperPages Business Details
After capturing the basic information from listings, it’s time to dig deeper into individual business details by visiting each listing’s dedicated page. This step will help you gather more in-depth information, such as operating hours, customer reviews, and additional contact details.
Inspecting HTML for Selectors
First, we’ll inspect the HTML structure of a SuperPages business detail page to identify where each piece of information is located. Here’s how:
- Open a Business Details Page: Click on any business name from the search results to open its details page.
-
Inspect the Page: Right-click and choose “Inspect” or press
Ctrl + Shift + I
to open Developer Tools.
- Find Key Elements:
-
Business Name: Found in an
<h1>
tag with a class.business-name
. -
Operating Hours: Displayed in rows within a
.biz-hours
table, where each day’s hours are in a<tr>
withth.day-label
andtd.day-hours
. -
Contact Information: Located in key-value pairs inside a
.details-contact
section, with each key in<dt>
tags and each value in corresponding<dd>
tags.
With these selectors identified, you’re ready to move to the next step.
Writing the Business Details Scraper
Now, let’s use these selectors in a Python script to scrape the specific details from each business page. First, we’ll make a request to each business detail page URL. Then, we’ll use BeautifulSoup to parse and extract the specific information.
Here’s the code to scrape business details from each page:
import requests
from bs4 import BeautifulSoup
import json
def get_business_details(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract the business name
name = soup.find('h1', class_='business-name').text.strip() if soup.find('h1', class_='business-name') else ""
# Extract operating hours in key-value pairs
hours = {
row.find('th', class_='day-label').text.strip(): row.find('td', class_='day-hours').text.strip()
for row in soup.select('.biz-hours tr')
}
# Extract contact information as key-value pair
contact_info = {
dt.text.strip().replace(':', ''): dd.text.strip()
for dt, dd in zip(soup.select('.details-contact dt'), soup.select('.details-contact dd'))
}
# Store the details in a dictionary
details = {
'name': name,
'hours': hours,
'contact_info': contact_info
}
return details
Saving Data in a JSON File
To make it easier to work with the scraped data later, we’ll save the business details in a JSON file. This lets you store and access information in an organized way.
def save_to_json(data, filename='business_details.json'):
with open(filename, 'w') as file:
json.dump(data, file, indent=4)
Complete Code Example
Here’s the complete code that includes everything from fetching business details to saving them in a JSON file.
import requests
from bs4 import BeautifulSoup
import json
def get_business_details(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract the business name
name = soup.find('h1', class_='business-name').text.strip() if soup.find('h1', class_='business-name') else ""
# Extract operating hours in key-value pairs
hours = {
row.find('th', class_='day-label').text.strip(): row.find('td', class_='day-hours').text.strip()
for row in soup.select('.biz-hours tr')
}
# Extract contact information as key-value pair
contact_info = {
dt.text.strip().replace(':', ''): dd.text.strip()
for dt, dd in zip(soup.select('.details-contact dt'), soup.select('.details-contact dd'))
}
# Store the details in a dictionary
details = {
'name': name,
'hours': hours,
'contact_info': contact_info
}
return details
def save_to_json(data, filename='business_details.json'):
with open(filename, 'w') as file:
json.dump(data, file, indent=4)
def main():
urls = [
'https://www.superpages.com/los-angeles-ca/bpp/evergreen-cleaning-systems-540709574?lid=1002188497939',
'https://www.superpages.com/van-nuys-ca/bpp/merry-maids-542022905?lid=1002108319143',
# Add more product URLs here
]
all_business_details = []
for url in urls:
business_details = get_business_details(url)
all_business_details.append(business_details)
# Save all details to a JSON file
save_to_json(all_business_details)
if __name__ == '__main__':
main()
Example Output:
[
{
"name": "Evergreen Cleaning Systems",
"hours": {
"Mon - Fri": "7:00 am - 8:00 pm",
"Sat": "7:00 am - 6:00 pm",
"Sun": "Closed"
},
"contact_info": {
"Phone": "Main - 213-375-1597",
"Address": "3325 Wilshire Blvd Ste 622 Los Angeles, CA 90010",
"Email": "Contact Us",
"Link": "https://www.evergreencleaningsystems.com"
}
},
{
"name": "Merry Maids",
"hours": {
"Mon - Fri": "7:30 am - 5:30 pm",
"Sat": "7:00 am - 3:00 pm"
},
"contact_info": {
"Phone": "Main - 818-465-8982",
"Address": "14741 Kittridge Street Van Nuys, CA 91405",
"Email": "Contact Us",
"Link": "http://www.merrymaids.com"
}
}
]
Optimizing SuperPages Scraper with Crawlbase Smart Proxy
To make our SuperPages scraper more robust and faster, we can use Crawlbase Smart Proxy. Smart Proxy has IP rotation and anti-bot protection, which is essential for not hitting rate limits or getting blocked during long data collection.
Adding Crawlbase Smart Proxy to our setup is easy. Sign up on Crawlbase and get an API token. We’ll use the Smart Proxy URL along with our token to make requests appear as though they’re coming from various locations. This will help us avoid detection and ensure uninterrupted scraping.
Here's how we can modify our code to use Crawlbase Smart Proxy:
import requests
from bs4 import BeautifulSoup
# Replace _USER_TOKEN_ with your Crawlbase Token
proxy_url = 'http://_USER_TOKEN_:@smartproxy.crawlbase.com:8012'
def get_business_details(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0"
}
proxies = {"http": proxy_url, "https": proxy_url}
response = requests.get(url=url, headers=headers, proxies=proxies, verify=False)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract the business name
name = soup.find('h1', class_='business-name').text.strip() if soup.find('h1', class_='business-name') else ""
# Extract operating hours in key-value pairs
hours = {
row.find('th', class_='day-label').text.strip(): row.find('td', class_='day-hours').text.strip()
for row in soup.select('.biz-hours tr')
}
# Extract contact information as key-value pair
contact_info = {
dt.text.strip().replace(':', ''): dd.text.strip()
for dt, dd in zip(soup.select('.details-contact dt'), soup.select('.details-contact dd'))
}
# Store the details in a dictionary
details = {
'name': name,
'hours': hours,
'contact_info': contact_info
}
return details
By routing our requests through Crawlbase, we add essential IP rotation and anti-bot measures that increase our scraper’s reliability and scalability. This setup is ideal for collecting large amounts of data from SuperPages without interruptions or blocks, keeping the scraper efficient and effective.
Final Thoughts
In this blog, we covered how to scrape SuperPages to get leads. We learned to extract business data like names, addresses, and phone numbers. We used Requests and BeautifulSoup to create a simple scraper to get that data.
We also covered how to handle pagination to get all the listings on the site. By using Crawlbase Smart Proxy, we made our scraper more reliable and efficient so we don’t get blocked during data collection.
By following the steps outlined in this guide, you can build your scraper and start extracting essential data. If you want to do more web scraping, check out our guides on scraping other key websites.
📜 Scrape Costco Product Data Easily
📜 How to Scrape Houzz Data
📜 How to Scrape Tokopedia
📜 Scrape OpenSea Data with Python
📜 How to Scrape Gumtree Data in Easy Steps
If you have any questions or feedback, our support team is here to help you. Happy scraping!
Frequently Asked Questions
Q. How can I avoid being blocked while scraping SuperPages?
To avoid getting blocked, add delays between requests, limit the request frequency, and rotate IP addresses. Tools like Crawlbase Smart Proxy can simplify this process by rotating IP addresses for you so your scraper runs smoothly. Avoid making requests too frequently and follow good scraping practices.
Q. Why am I getting no results when trying to scrape SuperPages?
If your scraper is not returning results, check that your HTML selectors match the structure of SuperPages. Sometimes, minor changes in the website’s HTML structure require you to update your selectors. Also, make sure you’re handling pagination correctly if you’re trying to get multiple pages of results.
Q. How can I save the scraped data in formats other than JSON?
If you need your scraped data in other formats like CSV or Excel, you can modify the script easily. For CSV, use Python’s csv
module to save data in rows. For Excel, the pandas
library has a .to_excel()
function that works well. This flexibility can help you analyze or share the data in a way that suits your needs.
Top comments (0)