DEV Community

98IP 代理
98IP 代理

Posted on

Building a Web Crawler with Python: Extracting Data from Web Pages

A web crawler, also known as a web spider, is an automated program that traverses web pages on the Internet to collect and extract the required data. As a powerful programming language, Python has become the preferred tool for building web crawlers with its concise syntax, rich library support, and active community. This article will guide you from scratch to build a simple web crawler using Python to extract data from web pages. In the process, we will introduce in detail how to deal with anti-crawler mechanisms and specifically mention 98IP proxy as one of the possible solutions.

I. Environment Preparation

1.1 Install Python

First, make sure that Python is installed on your computer. It is recommended to use Python 3 version because it has better performance and more library support. You can download and install the Python version suitable for your operating system from the official Python website.

1.2 Install Necessary Libraries

The following Python libraries are usually required to build a web crawler:

  • requests: used to send HTTP requests.
  • BeautifulSoup: used to parse HTML documents and extract data.
  • pandas: used for data processing and storage (optional).
  • Standard libraries such as time and random: used to handle delays, randomize requests, etc. to circumvent anti-crawler mechanisms. You can use pip (Python's package management tool) to install these libraries:
pip install requests beautifulsoup4 pandas
Enter fullscreen mode Exit fullscreen mode

II. Write a crawler program

2.1 Send HTTP requests

Use the requests library to send HTTP requests to obtain web page content:

import requests

url = 'http://example.com'  # Replace with the URL of the target page
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}  # Setting the User-Agent to emulate a browser
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    page_content = response.text
else:
    print(f"Request failed with status code: {response.status_code}")
Enter fullscreen mode Exit fullscreen mode

2.2 Parse HTML documents

Use BeautifulSoup to parse HTML documents and extract required data:

from bs4 import BeautifulSoup

soup = BeautifulSoup(page_content, 'html.parser')

# Example: Extract the text content of all title tags <h1>.
titles = soup.find_all('h1')
for title in titles:
    print(title.get_text())
Enter fullscreen mode Exit fullscreen mode

2.3 Dealing with anti-crawler mechanisms

In order to protect data security, many websites will take anti-crawler measures, such as IP blocking, verification code verification, etc. To circumvent these mechanisms, you can try the following methods:

  • Set the request header: simulate the browser to send a request, including fields such as User-Agent and Accept, as shown in the example above.
  • Use proxy IP: send requests through a proxy server to hide the real IP address. Services such as 98IP Proxy provide a large number of proxy IP resources that can help you bypass IP blocking.

Example of using 98IP Proxy:
First, you need to obtain the proxy IP address and port provided by 98IP Proxy. Then, when sending a request, pass this proxy information to the requestslibrary.

proxies = {
    'http': f'http://{proxy_ip}:{proxy_port}',  # Replace the proxy IP and port provided with 98IP
    'https': f'https://{proxy_ip}:{proxy_port}',  # If the proxy supports HTTPS, you also need to set the
}

# Note: In practice, you may need to obtain multiple proxy IPs from the 98IP proxy service and cycle through them to avoid a single IP being blocked.
# In addition, you need to handle situations where the proxy IP fails, for example by catching an exception to regain a valid proxy IP.

response = requests.get(url, headers=headers, proxies=proxies)
Enter fullscreen mode Exit fullscreen mode
  • Delay request: add a random delay between requests to simulate human browsing behavior.
  • Handling verification code: For verification code verification, you can consider using OCR (optical character recognition) technology or a third-party verification code recognition service. But please note that frequent attempts to crack verification codes may violate the website's terms of use.

III. Data storage and processing

3.1 Storing data

You can store the extracted data in local files, databases, or cloud storage services. Here is an example of saving data to a CSV file:

import pandas as pd

# Assuming you have extracted the required data and stored it in a list
data = [
    {'title': 'Heading 1', 'content': 'Content 1'},
    {'title': 'Heading 2', 'content': 'Content 2'},
    # ...
]

df = pd.DataFrame(data)
df.to_csv('output.csv', index=False)  # Save to CSV file
Enter fullscreen mode Exit fullscreen mode

3.2 Processing Data

Use libraries such as pandas to further process and analyze the data, such as data cleaning, transformation, aggregation, etc.

IV. Summary and Outlook

This article introduces how to use Python to build a simple web crawler to extract data from web pages. We discussed key steps such as environment preparation, writing crawler programs, dealing with anti-crawler mechanisms (especially using 98IP proxy as one of the solutions), and data storage and processing. By combining libraries such as requests, BeautifulSoup, and pandas, you can efficiently build and run web crawlers.

However, web crawler technology is not static. With the complexity of website structure and the upgrading of anti-crawler technology, you need to constantly learn new technologies and methods to meet new challenges. In addition, please be sure to comply with relevant laws and regulations and the terms of use of the website, and respect the intellectual property rights and data privacy of others. If you have difficulties in processing large-scale data or building complex crawler systems, you can consider seeking professional technical support or consulting services.

Top comments (0)