DEV Community

98IP 代理
98IP 代理

Posted on

How to use Scrapy and proxy IP to crawl data efficiently

In the data-driven era, web crawlers have become an important tool for enterprises and individuals to obtain Internet information. Scrapy, as an open source and powerful crawler framework, is widely praised for its efficiency and scalability. However, during the data collection process, frequent network requests often trigger the anti-crawler mechanism of the target website, resulting in the IP being blocked. In order to solve this problem, using Scrapy in combination with proxy IP has become an effective strategy for efficient data crawling. This article will explore in depth how to use Scrapy and proxy IP for efficient data collection and provide practical code examples, in which 98IP proxy will be briefly mentioned as an optional proxy IP service.

I. Scrapy Framework Basics

1.1 Overview of Scrapy Architecture

The Scrapy framework mainly includes several core components such as Spider, Item, Item Loader, Pipeline, Downloader Middlewares, and Extensions. Spider is responsible for defining crawling logic and generating requests; Item is used to define the crawled data structure; Item Loader provides a convenient way to fill Item; Pipeline is responsible for processing the crawled Item, such as data cleaning and storage; Downloader Middlewares allow requests and responses to be modified before or after downloading web pages; Extensions provide some additional functions, such as statistics, debugging tools, etc.

1.2 Scrapy project creation and configuration

First, create a new Scrapy project through the scrapy startproject myproject command. Then, create a new Python file in the spiders directory of the project, define the Spider class, and write the crawling logic. At the same time, it is necessary to define the data structure to be crawled in the items.py file and define the data processing flow in the pipelines.py file. Finally, run the specified Spider through the scrapy crawl spidername command.

II. Application of proxy IP in Scrapy

2.1 Why do you need a proxy IP?

In the process of data collection, in order to protect its own data from malicious crawling, the target website usually sets up anti-crawler mechanisms, such as IP blocking and verification code verification. Using proxy IP can hide the real IP address, bypass the anti-crawler mechanism by constantly changing the proxy IP, and improve the success rate and efficiency of data collection.

2.2 Configure proxy IP in Scrapy

In order to use proxy IP in Scrapy, we need to customize a Downloader Middleware. The following is a simple sample code:

# middlewares.py
import random

class RandomProxyMiddleware:
    # Let's say we have a list containing proxy IP addresses
    PROXY_LIST = [
        'http://proxy1.example.com:8080',
        'http://proxy2.example.com:8080',
        # ... More proxy IPs can be added
    ]

    def process_request(self, request, spider):
        # Randomly select a proxy IP from the list
        proxy = random.choice(self.PROXY_LIST)
        request.meta['proxy'] = proxy
Enter fullscreen mode Exit fullscreen mode

Then, enable this Middleware in the settings.py file:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RandomProxyMiddleware': 543,
}
Enter fullscreen mode Exit fullscreen mode

Note: The PROXY_LIST here is just an example. In actual applications, we can use third-party proxy IP services, such as 98IP Proxy, to dynamically obtain proxy IP. 98IP Proxy provides a high-quality proxy IP pool and a stable API interface, which can greatly simplify the configuration and management of proxy IP.

2.3 Rotation and error handling of proxy IP

In order to avoid a single proxy IP being blocked due to frequent use, we can implement the rotation logic of proxy IP in Middleware. At the same time, in order to handle request failures (such as invalid proxy IP, target website response timeout, etc.), we need to add error handling logic. The following is an improved Middleware example:

# middlewares.py (Improved version)
import random
import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.exceptions import NotConfigured, IgnoreRequest
from scrapy.utils.response import get_response_for_exception

class ProxyRotatorMiddleware:
    PROXY_LIST = []  # Here you should dynamically get a list of proxy IPs from a service such as 98IP Proxy
    PROXY_POOL = set()  # Used to store currently available proxy IPs
    PROXY_ERROR_COUNT = {}  # For logging the number of errors per proxy IP

    def __init__(self, crawler):
        if not self.PROXY_LIST:
            raise NotConfigured
        self.crawler = crawler
        # Initialize the proxy IP pool (the code to get the proxy IP from the 98IP proxy is omitted here)
        # ...

    def process_request(self, request, spider):
        if not self.PROXY_POOL:
            self.refresh_proxy_pool()  # Refresh proxy IP pool when it is empty

        proxy = random.choice(list(self.PROXY_POOL))
        request.meta['proxy'] = proxy

        # Record proxy IP usage time for subsequent rotation
        request.meta['proxy_used_at'] = time.time()

    def process_exception(self, request, exception, spider):
        proxy = request.meta.get('proxy')
        if proxy:
            self.PROXY_ERROR_COUNT[proxy] = self.PROXY_ERROR_COUNT.get(proxy, 0) + 1
            if self.PROXY_ERROR_COUNT[proxy] > 3:  # Remove a proxy IP from the proxy IP pool if it has more than three errors
                self.PROXY_POOL.discard(proxy)

        # Calling RetryMiddleware's process_exception method to handle retry logic
        return RetryMiddleware.from_crawler(self.crawler).process_exception(request, exception, spider)

    def spider_opened(self, spider):
        self.refresh_proxy_pool()  # Refresh the proxy IP pool at the start of the crawler

    def refresh_proxy_pool(self):
        # Get a new list of proxy IPs from services such as 98IP Proxy and update PROXY_POOL and PROXY_ERROR_COUNT
        # ...
Enter fullscreen mode Exit fullscreen mode

In this improved version of Middleware, we added PROXY_POOL to store the currently available proxy IPs and PROXY_ERROR_COUNT to record the number of errors for each proxy IP. At the same time, we implemented the refresh_proxy_pool method to dynamically obtain a new proxy IP list from services such as 98IP proxy, and refresh the proxy IP pool when the crawler starts and when the proxy IP pool is empty. In addition, we also added the process_exception method to handle request failures and remove invalid proxy IPs from the proxy IP pool based on the number of errors.

III. Efficient data crawling strategy

3.1 Concurrent requests and speed limits

Scrapy supports concurrent requests, but too high concurrency may cause the target website to be blocked due to excessive pressure. Therefore, we need to reasonably set the number of concurrent requests and download delay. In the settings.py file, the number of concurrent requests and download delay can be set through configurations such as CONCURRENT_REQUESTS and DOWNLOAD_DELAY.

3.2 Data deduplication and denoising

During the data collection process, duplicate data or noisy data may appear. In order to improve data quality, we can implement data deduplication and denoising logic in Pipeline. For example, we can use a collection to store the ID of the captured data to avoid duplicate data, or use regular expressions to remove irrelevant information.

3.3 Exception handling and logging

During the data collection process, various abnormal situations may be encountered, such as network errors, changes in the target website structure, etc. In order to discover and handle these problems in a timely manner, we need to add exception handling logic to the code and record detailed log information. Scrapy provides a built-in logging function, and the level and format of log output can be controlled by configuring parameters such as LOG_LEVEL.

IV. Conclusion

Using Scrapy in combination with proxy IP for efficient data collection is a complex and interesting process. By reasonably configuring Scrapy's Downloader Middlewares mechanism, using high-quality proxy IP services (such as 98IP proxy), implementing proxy IP rotation and error handling, and adopting efficient data capture strategies, the success rate and efficiency of data collection can be greatly improved. However, data collection is a sensitive and complex area. Developers should comply with relevant laws and regulations and the website's usage agreement when using Scrapy for data collection, and respect the data rights and privacy protection of the target website. At the same time, it is also necessary to pay attention to the reasonable use of proxy IPs and comply with the usage regulations of proxy IP services to avoid legal problems or service bans caused by the abuse of proxy IPs.

Top comments (0)