As the need for data-driven decision making continues to grow, developers are turning to web scraping with Python tools like Beautiful Soup, Scrapy, and Selenium to efficiently extract information from both static and dynamic web pages. In this step-by-step tutorial, you will learn how to use popular libraries such as Requests and Beautiful Soup to scrape the data you need.
What is Web Scraping
Web scraping, at its core, is the process of automatically extracting large amounts of data from websites. Unlike traditional data gathering methods, web scraping leverages code to interact with web pages, mimicking how a human would browse a site, but with far greater efficiency and speed. Python, with its rich ecosystem of libraries such as BeautifulSoup, Scrapy, and Selenium, has become one of the most popular languages for web scraping due to its ease of use and flexibility.
The main objective of web scraping is to convert unstructured web data, often found in HTML or JavaScript formats, into structured formats like CSV, JSON, or databases, which can then be analyzed or integrated into various applications. This process is especially useful for industries like e-commerce, finance, market research, and SEO, where extracting and analyzing competitor data, tracking pricing trends, or gathering large datasets is crucial for decision-making.
Python Web Scraping 101
Web scraping is a must-have skill for extracting publicly available data from the web, and Python is one of the most popular languages for this job. In Python web scraping, you typically start by sending HTTP requests to a website, retrieving the HTML content, and then parsing it to extract the desired data.
Here are the basic steps involved:
1. Sending Requests
To begin scraping, you first need to send a request to the target website. This is commonly done using the Requests library in Python, which makes HTTP requests to web servers.
import requests
url = 'https://example.com'
response = requests.get(url)
print(response.content)
2. Parsing the Data
Once you've retrieved the webpage, the next step is parsing the HTML content to extract relevant data.
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.title.string
print(title)
3. Handling Dynamic Content
Many modern websites rely on JavaScript to load content dynamically, which means the data you want may not appear in the initial HTML source. For these cases, Selenium is a great tool, as it simulates a real browser and can interact with dynamic pages.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://example.com')
content = driver.page_source
With these few steps, you will be able to start basic web scraping. Next, we will give you a detailed introduction: How to Scrape a Website Using Python (Step-by-Step)
How to Scrape a Website Using Python (Step-by-Step)
To build a data scraping tool using Python, you need to download and install the following tools.
Python: https://www.python.org/downloads/ This is the core software for running Python. You can download the version we need from the official website, as shown in the figure below. However, it is recommended not to download the latest version. You can download 1-2 versions before the latest version.
Python IDE: Any IDE that supports Python will do, but we recommend PyCharm, an IDE development tool designed specifically for Python. For the PyCharm version, we recommend the free PyCharm Community Edition.
Pip: You can use the Python Package Index to install the libraries required to run your program with a single command.
Note: If you are a Windows user, don't forget to check the "Add python.exe to PATH" option in the installation wizard. This will allow Windows to use Python and commands in the terminal. Since Python 3.4 or later includes it by default, you don't need to install it manually.
Through the above steps, the environment for crawling data using Python has been set up. Next, we can use the pycharm we installed to crawl data on the website.
Step 1: Start PyCharm and select File > New Project... in the menu bar.
Step 2: Then, in the pop-up window, select Pure Python from the left menu and set up your project as shown below:
Note: In the red box below, select the Python installation path you downloaded in the first step of the environment configuration.
Step 3: You can create a project called python-scraper, check the option to create a main.py welcome script in the folder, and click the Create button. After a while of PyCharm setting up the project, you should see something like this:
Step 4: Then, right click to create a new Python file.
At present, our environment for crawling data has been set up. The next step is how to crawl the data we need on the web page.
1. Before actually crawling data, we must first do some preliminary work, such as understanding the URL, observing parameters, etc., and we can use the browser's built-in developer tools to learn and practice these points. Let's take Amazon as an example to crawl the price of a certain product and some other data we need.
https://www.amazon.com/s?k=computer&page=3
You can deconstruct any of them into two main parts:
- Base URL: The path to the store portion of the site. Here it's https://www.amazon.com/s?k=computer&page=3.
- Specific page location: The path to a specific product. The URL might end in .html, .php, or have no extension at all. The base URL is the same for all products on the site. The difference between each page is the second half of the URL, which contains a string that specifies which product page the server should return. Typically, URLs for the same type of page have a similar format overall.
In addition, URLs can also contain extra information:
- Path parameters: These are used to capture specific values in RESTful methods (for example, in https://www.example.com/users/14, 14 is a path parameter).
- Query parameters: These are added to the end of the URL after the question mark (?). They usually encode the filter values to be sent to the server when performing a search (for example, in https://www.example.com/search?search=blabla&sort=newest, search=blabla and sort=newest are query parameters). Note that any query parameter string contains the following:
- ?: This marks the beginning.
- key=value A list of parameters separated by &: key is the name of a parameter, while value shows its value. The query string contains parameters in key-value pairs separated by & characters.
In other words, URLs are more than just simple location strings of HTML documents. They can also contain parameter information that the server can use to run queries and fill pages with specific data.
In the example, 3 is the path parameter, and cpmputer is the value of the search query. This URL will instruct the server to run a paged search query and get all results containing the string cpmputer, and then return only the results from the third page.
2. Now you are familiar with the website. The next step is to delve into the HTML code of the page, studying its structure and content to understand how to extract data from it.
All modern browsers come with a set of advanced development tools, and most offer the same functionality. These tools enable you to explore the HTML code of a web page and work with it. In this Python web scraping tutorial, you'll see Chrome's DevTools in action.
Right-click an HTML element and select Inspect to open the DevTools window. If the right-click menu is disabled for a website, do the following:
- On macOS: View > Developer > Developer Select Tools in the menu bar.
- On Windows and Linux: Click the ⋮ menu button in the top-right corner and select the More Tools > Developer tools option.
They allow you to inspect the structure of the Document Object Model (DOM) of a web page. This in turn can help you gain a deeper understanding of the source code. In the DevTools section, enter the option Elements to access the DOM.
In Amazon, after opening the developer tools, we click "Elements" as shown in the figure above, and then use the arrow on the left side of "Elements" to click anywhere on the page where we want to get the data. In this way, the location of the data we need will be displayed in the HTML source code in "Elements". Then we can crawl data according to the tag where it is located.
If you find it difficult to understand the difference between DOM and HTML:
- HTML code represents the content of the web document written by the developer.
- DOM is a dynamic memory representation of HTML code created by the browser. In JavaScript, you can manipulate the DOM of the page to change its content, structure, and style.
3. Suppose you want to crawl data from the following location:
Example
https://www.amazon.com/s?k=computer&page=3
First, you need to retrieve the HTML code of the target page. In other words, you must download the HTML document associated with the page URL. To do this, use the Python requests library.
In the Terminal tab of your PyCharm project, launch the following command to install requests:
Terminal
pip install requests
Open the scraper.py file and initialize it with the following line of code:
scrape.py
import requests# download the HTML document# with an HTTP GET request
response = requests.get("https://www.amazon.com/s?k=computer&page=3")# print the HTML codeprint(response.text)
This code snippet imports the requests dependencies. It then uses the get() function to perform an HTTP GET request to the target page URL and returns a Python representation of the response containing an HTML document.
You can also get the complete Python request code by pasting the request URL of the page directly here: https://curlconverter.com/python/
import requests
cookies = {
'session-id': '140-2992726-6859939',
'session-id-time': '2082787201l',
'i18n-prefs': 'USD',
'ubid-main': '132-6184525-8448226',
'lc-main': 'en_US',
'skin': 'noskin',
}
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'device-memory': '8',
'downlink': '1.5',
'dpr': '1',
'ect': '3g',
'pragma': 'no-cache',
'priority': 'u=0, i',
'rtt': '300',
'sec-ch-device-memory': '8',
'sec-ch-dpr': '1',
'sec-ch-ua': '"Not A(Brand";v="8", "Chromium";v="132", "Google Chrome";v="132"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-ch-ua-platform-version': '"10.0.0"',
'sec-ch-viewport-width': '1070',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
'viewport-width': '1070',
}
params = {
'k': 'computer',
'page': '3',
}
response = requests.get('https://www.amazon.com/s', params=params, cookies=cookies, headers=headers)
print(response.text)
In pycharm, add print(response.text) at the end of the code. After running the code, you will get an HTML page code as shown below.
⚠️ Common Mistake: Forgetting Error Handling Logic!
A GET request to a server can fail for a number of reasons. The server may be temporarily unavailable, the URL may be wrong, or your IP may have been blocked. So you may want to handle the error as follows:
response = requests.get("https://www.amazon.com/s?k=computer&page=3")
if the response is 2xx
if response.ok:scraping logic here...
else:log the error response
in case of 4xx or 5xx
print(response)
This way, the script will not crash if there is an error with the request, and will only continue running on a 2xx response.
4. Parse HTML content with Beautiful Soup
In the previous step, you retrieved an HTML document from the server. If you look at it, you'll see a long string of code, and the only way to understand it is to extract the required data through HTML parsing.
Beautiful Soup is a Python library for parsing XML and HTML content, which exposes an API to explore HTML code. In other words, it allows you to select HTML elements and easily extract data from them.
To install the library, launch the following command in your terminal:
pip install beautifulsoup4
Then, use it to parse the content requests retrieved in this way:
import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
# download the target page
response = requests.get("https://www.amazon.com/s?k=computer&page=3")
# parse the HTML content of the page
soup = BeautifulSoup(response.content, "html.parser")
The BeautifulSoup() constructor accepts some content and a string specifying the parser to use. "html.parser" instructs Beautiful Soup to use the HTML parser.
⚠️ Common mistake: Passing response.text instead of response.content to BeautifulSoup().
The content object's attributes hold the HTML data in the form of raw bytes of the response, which is easier to decode than the text representation stored in the text attribute. To avoid character encoding issues, it is best to use response.content.
response.textBeautifulSoup()
It should be noted that websites contain data in many formats. Single elements, lists, and tables are just a few examples. If you want your Python scraping tool to be effective, you need to know how to use Beautiful Soup in many situations.
As shown in the figure below, in the data we searched, if we want to get the data information of a certain product, we may need to parse it layer by layer from the parent tag, and then get the price, product name, etc.
[Picture]
Beautiful Soup provides a variety of methods to select HTML elements from the DOM, among which the id field is the most effective method to select a single element. As the name implies, the field id uniquely identifies the HTML node on the page. But this also depends on the design of the HMTL page of the website we want to crawl. If there is no id, then we may need to obtain the data we need through the content that is as unique as possible.
The data-asin attribute in the above figure represents the unique code of the product. By combining it with the role attribute, we can get the code in the HMTL corresponding to this product:
target_div = soup.find('div', attrs={'data-asin': 'B0DJXW94BL', 'role': 'listitem'})
If you have a tag with an id, you can write code like this:
product_search_element = soup.find(id="woocommerce-product-search-field-0")
We use "find()" to find tags to get the desired data. The detailed usage of "find()" method is as follows:
- By tag: Use the find() function without parameters:
scraper.py
# get the first <h1> element on the page
h1_element = soup.find("h1")
- By class: find() via the class_ parameter
scraper.py
# find the first element on the page with "search_field" class
search_input_element = soup.find(class_="search_field")
- By attributes: find() via attrs parameter
scraper.py
# find the first element on the page with the name="s" HTML attribute
search_input_element = soup.find(attrs={"name": "s"})
You can also retrieve HTML nodes using CSS selectors and select(): select_one()
# find the first element identified by the "input.search-field" CSS selector
search_input_element = soup.select_one("input.search-field")
On a text content HTML element, extract its text using the get_text() method:
h1_title = soup.select_one(".beta.site-title").getText()
print(h1_title)
⚠️ Common mistake: Not checking for None.
When find() and select_one() cannot find the desired element, they return None. Because pages change over time, you should always perform a non-None check, like this:
product_search_element = soup.find(id="woocommerce-product-search-field-0")
#making sure product_search_element is present on the pagebefore trying to access its data
if product_search_element is not None:
placeholder_string = product_search_element["placeholder"]
The above example allows us to find the HTML code of a certain product. If we want to return more precise data, such as the product title, we can use the same method to first find the tag where the product title is located, and then get it from the subtag of the product information we just obtained.
By observing that the product title is in the "span" tag, you can get the title text data by getting the "span" tag under the "h2" tag under the "a" tag:
if target_div:
a_tags = target_div.find_all('a')
for a_tag in a_tags:
h2_tag = a_tag.find('h2')
if h2_tag:
span_tag = h2_tag.find('span')
if span_tag:
print(span_tag.get_text())
else:
print("No matching div found.")
Note: There may be multiple "a", "h2" and other similar tag elements in different web pages, so it is best to use some relatively unique tag attributes to obtain data when using them.
The above is a small demo on how to use Python to crawl web page data. If you want to be more proficient and obtain relevant data more accurately, you may need a lot of practice, otherwise it may only waste our own time.
A simpler crawling method recommendation: Scrapeless Scraping API
While scraping with Python is a flexible and powerful method, it usually requires writing a lot of code and dealing with complex technical details. For those who want to simplify the scraping process, Scrapeless Scraping API provides a more convenient solution.
🚀 Through AI-driven functions, Scrapeless allows users to easily extract public data without writing complex code or dealing with issues such as proxy and IP management.
🌍 Whether you need to access more than 80 million real IPs or private data center IPs, Scrapeless can provide efficient and reliable data scraping services.
⚡ Its simple API interface makes integration very easy, and users can quickly start scraping tasks with simple configuration, saving a lot of development time and effort.
For users who want to avoid the complexity of traditional scraping methods, Scrapeless is undoubtedly a more convenient option worth recommending.
How to Use Scrapeless Scraping API for Data Extraction:
Step 1. Log into the Scrapeless Dashboard and go to "Amazon".
Step 2. According to your scraping requirements, enter the corresponding URL and set the corresponding Action, then click Start Scraping.
Step 3. Get the crawling results and export them.
You may also need:
How to Scrape Google Trends Data With Python?
How to Use BeautifulSoup for Web Scraping in Python
How to Scrape Amazon Search Result Data: Python Guide 2025
Seamlessly Integrate Scrapeless into Your Project
If you need to integrate Scrapeless into your project, you can also click to view the completed documentation.
Product
import http.client
import json
conn = http.client.HTTPSConnection("api.scrapeless.com")
payload = json.dumps({
"actor": "scraper.amazon",
"input": {
"url": "https://www.amazon.com/dp/B0BQXHK363",
"action": "product"
}
})
headers = {
'Content-Type': 'application/json'
}
conn.request("POST", "/api/v1/scraper/request", payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
Seller
import http.client
import json
conn = http.client.HTTPSConnection("api.scrapeless.com")
payload = json.dumps({
"actor": "scraper.amazon",
"input": {
"url": "",
"action": "seller"
}
})
headers = {
'Content-Type': 'application/json'
}
conn.request("POST", "/api/v1/scraper/request", payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
Keywords
import http.client
import json
conn = http.client.HTTPSConnection("api.scrapeless.com")
payload = json.dumps({
"actor": "scraper.amazon",
"input": {
"action": "keywords",
"keywords": "iPhone 12",
"page": "5",
"domain": "com"
}
})
headers = {
'Content-Type': 'application/json'
}
conn.request("POST", "/api/v1/scraper/request", payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
Best Practices for Web Scraping with Python
Web scraping with Python requires a balance between efficiency, ethics, and legal compliance. Here are a few best practices for web scraping Python projects:
1. Respect a site’s terms of service and robots.txt files
One of the most important aspects of web scraping is respecting a site’s terms of service and robots.txt files. Always check these files before scraping to make sure you’re not violating the site’s rules. This helps you avoid legal issues and ensures that your scraping activities are compliant with the website’s guidelines.
2. Minimize the load on the site’s server
Another best practice is to minimize the load on the site’s server. Scraping data too aggressively can overwhelm the server, negatively impacting the site’s performance, and may result in being blocked. When using web scraping Python tools, implement techniques like rate limiting and crawl delay to ensure that your scraping activities do not cause disruptions.
3. Ensure the data you collect is handled ethically and securely
Finally, ensure the data you collect is handled ethically and securely. Data privacy is more important than ever, and respecting the privacy of the users whose data you collect is critical. Adopting ethical web scraping Python practices will not only keep your operations compliant, but it will also build trust within the industry. Make sure to store and process the data in a secure manner, adhering to privacy regulations like GDPR.
Advanced Web Scraping Techniques
As websites become more dynamic and anti-scraping technology continues to upgrade, web scraping technology in 2025 requires developers to master a series of advanced skills to ensure efficiency while dealing with legal and ethical challenges. Here are several advanced techniques for modern web scraping:
- Use browser simulation and automation technology
Many websites now use browser automation technology to enhance the user experience, which also brings new challenges to data scraping. For example, by simulating user behavior to bypass certain anti-scraping measures. Using browser automation tools such as Puppeteer or Playwright can simulate user interactions, which is particularly useful for crawling content that requires interaction (such as verification codes or login processes). Compared with traditional static web scraping methods, this technology can better cope with the complexity of modern websites.
- Leverage proxy pools and IP rotation
In large-scale scraping projects, preventing being blocked by the target website is a long-standing challenge. To avoid IP blocking, developers usually use proxy pools to bypass anti-scraping mechanisms by rotating multiple IPs. Using an automation platform like Scrapeless can help manage a large number of proxies, reduce the risk of blocked IPs, and ensure the stability of data scraping.
- AI-driven intelligent data extraction
Artificial intelligence is driving data scraping technology forward. AI can not only help identify and extract specific data points, but also identify images, text, and structured data in complex pages. Machine learning algorithms can be used for image recognition and sentiment analysis to improve the precision and accuracy of scraping.
- Use APIs to improve scraping efficiency
For many companies, using professional crawler APIs is an efficient way to simplify the data scraping process. APIs such as Scrapeless can provide powerful proxy management), CAPTCHA bypass, and IP rotation functions, which can effectively reduce the burden of manual management while increasing the success rate of data scraping.
In summary, web scraping technology in 2025 will need to combine traditional scraping tools with more advanced technologies to cope with the challenges of dynamic websites, large-scale data extraction, and anti-crawling technologies. Using API solutions like Scrapeless can make the scraping process more efficient, reliable, and scalable, helping developers save time and reduce technical difficulty.
Web Scraping with Scrapeless -- in just 3 minutes
Don't bother with complicated configurations and anti-scraping measures anymore. With Scrapeless, you can easily manage proxies, bypass CAPTCHAs, and scale your scraping projects with ease. Start experiencing more reliable, fast, and secure data extraction than ever before. Try Scrapeless today and take your web scraping to the next level!
FAQs about Web Scraping with Python
1. Is web scraping legal?
The legality of web scraping depends on various factors, including the website's terms of service and local laws. It's crucial to review a site's robots.txt file and its acceptable use policy before scraping to ensure compliance. Unauthorized scraping may lead to legal consequences, so it's best to approach it ethically.
Additionally, using tools like Scrapeless ensures compliance with ethical and legal standards while scraping.
2. What are the key challenges in web scraping?
Web scraping can present several challenges, including dealing with anti-scraping measures like CAPTCHAs, IP blocking, and dynamic content loading. Websites may also have complex structures that make data extraction difficult.
3. How do I handle CAPTCHA while scraping websites?
Many websites use CAPTCHA to prevent automated scraping. To bypass CAPTCHA, you can integrate CAPTCHA-solving services or use tools like Selenium to mimic human-like behavior. However, for a more streamlined experience, platforms like Scrapeless often provide features to manage CAPTCHA automatically, reducing the complexity of scraping such sites.
Conclusion: Web Scraping with Python
As the demand for web scraping explodes, web scraping with Python remains one of the most important means.
However, as scraping becomes increasingly complex due to more advanced anti-bot measures, the need for smarter, more efficient solutions is obvious. This is where tools like Scrapeless come into play. By leveraging Scrapeless's cutting-edge scraping API, you can streamline the scraping process, bypass CAPTCHAs, and avoid common obstacles that typically slow down manual coding. With Scrapeless, you can scale your scraping projects with minimal effort while maintaining high data accuracy.
Don't let scraping challenges slow you down - explore Scrapeless today and take your web scraping to the next level.
Want to simplify your web scraping process?
Get started with Scrapeless today! Enjoy a free trial and discover how our powerful tools can save you time and effort. No coding skills required—just efficient, hassle-free scraping!
Top comments (0)