98IP Proxy

Posted on Jan 11

How to crawl and parse JSON data with Python crawler

#python #json #crawl #proxyip

In the data-driven era, Python crawler technology has become an important means of obtaining network data. JSON (JavaScript Object Notation), as a lightweight data exchange format, has become a popular choice for network data transmission and storage because it is easy for people to read and write, and easy for machines to parse and generate. This article will explore in depth how to crawl and parse JSON data using Python crawler technology, and at the same time combine the use of 98IP proxy IP to improve the stability and efficiency of crawlers. The following will include specific code examples.

I. Python crawler basics

1.1 Introduction to Python crawler

Python crawler, that is, a web crawler program written in Python, can automatically access web pages, extract required data, and save it locally or in a database. Python's rich libraries and tools, such as requests, json, etc., provide great convenience for crawler development.

1.2 JSON data format

JSON is a text-based data exchange format that is easy for people to read and write, and easy for machines to parse and generate. It uses key-value pairs to store data, which can represent simple data structures and complex nested data structures.

II. Python crawler crawls JSON data

2.1 Determine the target website

First, you need to determine a target website that provides JSON data. This is usually an API interface, and the data it returns is in JSON format. For example, we can use a hypothetical weather API interface.

2.2 Send HTTP request

Use Python's requests library to send HTTP requests to access the API interface of the target website.

import requests

# Target API Interface URL
url = 'https://api.exampleweather.com/v1/current.json?key=YOUR_API_KEY&q=London'

# Send a GET request
response = requests.get(url)

2.3 Process HTTP response

After sending the request, you need to process the HTTP response. If the response status code is 200, it means the request is successful, and the response content can be further parsed.

# Check the response status code
if response.status_code == 200:
    # Parsing JSON data
    data = response.json()
    print(data)
else:
    print(f"Request failed with status code：{response.status_code}")

2.4 Parse JSON data

Use Python's json library (but the requests library has encapsulated the json() method and can be called directly) to parse the JSON data in the HTTP response.

The above code already includes this step, that is, data = response.json().

III. Combine 98IP proxy IP to improve crawler stability

3.1 Why do you need a proxy IP?

During the crawler process, frequent visits to the target website may cause the IP to be blocked. Using a proxy IP can bypass this obstacle and improve the stability of the crawler.

3.2 Select 98IP proxy IP

98IP is a professional proxy IP service provider that provides stable, efficient and secure proxy IP services. Users need to register and obtain a proxy IP list.

3.3 Configure proxy IP

When using the requests library to send HTTP requests, you can configure the proxy IP by setting the proxies parameter.

# Assuming a list of proxy IPs obtained from 98IP
proxies = {
    'http': 'http://proxy_ip:port',
    'https': 'https://proxy_ip:port',
}

# Use a proxy IP when sending GET requests
response = requests.get(url, proxies=proxies)

Note: The proxy_ip:port above needs to be replaced with the actual proxy IP address and port number.

3.4 Switch proxy IP

In order to avoid a single proxy IP being blocked, you can switch the proxy IP regularly. This can be achieved by writing a function that randomly selects an IP from the proxy IP list provided by 98IP for configuration.

import random

# Hypothetical Proxy IP List
proxy_list = [
    'http://proxy1_ip:port',
    'http://proxy2_ip:port',
    # ... 更多代理IP
]

# Randomly select a proxy IP
def get_random_proxy():
    return random.choice(proxy_list)

# Get random proxy IP and configure
proxies = {
    'http': get_random_proxy(),
    'https': get_random_proxy(),  # Note: Usually http and https use the same proxy IP, but they can be different.
}

# Use a random proxy IP when sending requests
response = requests.get(url, proxies=proxies)

Note: In actual applications, the proxy IP list needs to be updated regularly to ensure the validity of the proxy IP.

IV. Actual combat case: crawling JSON data of an API interface (combined with proxy IP)

4.1 Target website analysis

Suppose we want to crawl an API interface that provides weather information, and the data format returned by the interface is JSON.

4.2 Write crawler code

The following is a complete crawler code example combined with the use of proxy IP.

import requests
import random
import time

# Target API Interface URL
url = 'https://api.exampleweather.com/v1/current.json?key=YOUR_API_KEY&q=London'

# Hypothetical proxy IP list (needs to be replaced with actual proxy IPs)
proxy_list = [
    'http://proxy1_ip:port',
    'http://proxy2_ip:port',
    # ... More Proxy IP
]

# Function to randomly select a proxy IP
def get_random_proxy():
    return random.choice(proxy_list)

# Crawler main function
def crawl_weather_data():
    while True:
        try:
            # Get random proxy IP and configure
            proxies = {
                'http': get_random_proxy(),
                'https': get_random_proxy(),
            }

            # Send a GET request
            response = requests.get(url, proxies=proxies, timeout=10)

            # Check the response status code
            if response.status_code == 200:
                # Parsing JSON data
                data = response.json()
                print(data)
                # Save data locally or to a database as needed
                break  # Let's say we get the data only once and exit the loop
            else:
                print(f"Request failed with status code: {response.status_code}, retrying...")

        # Catch possible exceptions such as network errors, proxy IP failures, etc.
        except requests.RequestException as e:
            print(f"Request Exception: {e}, retry in progress...")

        # Wait for some time and retry
        time.sleep(5)

# Running the crawler
crawl_weather_data()

4.3 Run the crawler and analyze the results

Run the above crawler code, print the crawled weather data to the console, and save the data to a local file or database as needed.

V. Precautions and best practices

5.1 Comply with laws and regulations

When using crawler technology, please be sure to comply with relevant laws and regulations and the website's usage agreement, and do not conduct malicious attacks, infringe on others' privacy, or violate ethics and laws.

5.2 Reasonably set the request frequency

In order to avoid excessive load pressure on the target website, please reasonably set the request frequency and avoid too frequent access. You can add appropriate delays between requests.

5.3 Regularly update the proxy IP list

Since the proxy IP may be blocked or invalid, it is recommended to regularly update the proxy IP list to ensure the stable operation of the crawler. You can automatically obtain the latest proxy IP list from proxy IP service providers such as 98IP by writing scripts.

5.4 Capture and handle exceptions

When writing crawler code, it is recommended to use the try-except statement to capture and handle possible exceptions to improve the robustness of the code. The above code example already includes the exception handling part.

VI. Conclusion

This article introduces in detail how to use Python crawler technology to crawl and parse JSON data, and combines the use of 98IP proxy IP to improve the stability and efficiency of the crawler. Through the demonstration of actual cases and the provision of code examples, readers can understand and master this technology more intuitively. I hope this article can be helpful to readers and play a positive role in actual projects.

Combined with the use of 98IP proxy IP to improve the stability and efficiency of the crawler

DEV Community