Complete Guide to Scrapy Proxy IP Pool Integration

📂 Stage: Stage 3 - Offensive and Defense Drills (Middleware and Anti-Climbing) 🔗 Related chapters: Downloader Middleware · 反爬对抗实战

IP blocking is one of the most common challenges in large-scale crawling projects. A stable and efficient proxy IP pool can help crawlers disguise their identities, disperse request sources, and effectively avoid IP bans. This article will systematically explain the method of integrating proxy IP pools in Scrapy, covering core technologies such as dynamic proxy switching, proxy pool management, and quality inspection, to help you improve the stability and success rate of your crawler.

Table of contents

Basic concepts of proxy IP

Proxy IP is an important means for crawlers to fight anti-crawling. Its core principle is that the client's request is no longer sent directly to the target server, but is forwarded through a proxy server, thus hiding the client's real IP.

客户端 → 代理服务器 → 目标服务器
  ↓        ↓        ↓
发起请求 → 转发请求 → 返回响应

Main categories of agents

  • Classification by protocol: HTTP, HTTPS, SOCKS4, SOCKS5
  • Classification by degree of anonymity:
  • Transparent Proxy: The target server can identify the real IP, not recommended for crawlers
  • Anonymous Proxy: Hide the real IP, but will tell the server that a proxy is used
  • High Anonymous Proxy: Completely hide the real IP, the server cannot detect the existence of the proxy (Highly Recommended)

💡 For crawlers, High Hidden Proxy is the most stable and safest choice.

Proxy IP type and selection

In actual projects, proxy IP can be obtained from different channels, and each method has its applicable scenarios.

TypeAdvantagesDisadvantagesApplicable scenarios
Free agentZero costPoor stability, slow speed, low survival rateSmall-scale testing, learning exercises
Paid AgentGood stability, fast speed, reliable serviceContinuous payment requiredCommercial projects, large-scale continuous collection
Self-built agencyFully controllable, high security, flexible customizationHigh technical threshold, high initial costLong-term large-scale projects, high security requirements

For most teams, paid agents + self-built agent pool is the most cost-effective combination.

Basic proxy middleware implementation

In Scrapy, proxy switching is usually implemented through Downloader Middleware. Let's start with the simplest implementation and gradually build a usable proxy middleware.

Simple random proxy middleware

import random
import logging

class SimpleProxyMiddleware:
    """简单的随机代理中间件"""

    def __init__(self):
        self.proxies = [
            'http://proxy1.com:8080',
            'http://proxy2.com:8080',
            'http://proxy3.com:8080',
        ]
        self.logger = logging.getLogger(__name__)

    def process_request(self, request, spider):
        # 如果请求尚未设置代理,则随机分配一个
        if 'proxy' not in request.meta:
            proxy = random.choice(self.proxies)
            request.meta['proxy'] = proxy
            self.logger.info(f"为 {request.url} 分配代理: {proxy}")
        return None

Enabled method: insettings.pyAdd this middleware toDOWNLOADER_MIDDLEWARESConfiguring.

Middleware that supports configuration and retry

The following middleware supports reading a list of proxies from the configuration and the ability to retry failed proxies a limited number of times.

import random
import logging

class ConfigurableProxyMiddleware:
    """可配置的代理中间件,支持认证和重试"""

    def __init__(self, proxy_list, retry_times=3):
        self.proxy_list = proxy_list
        self.retry_times = retry_times
        self.logger = logging.getLogger(__name__)

    @classmethod
    def from_crawler(cls, crawler):
        # 从 Scrapy 配置中获取代理列表和重试次数
        proxy_list = crawler.settings.getlist('PROXY_LIST', [])
        retry_times = crawler.settings.getint('PROXY_RETRY_TIMES', 3)
        return cls(proxy_list, retry_times)

    def process_request(self, request, spider):
        if request.meta.get('proxy'):
            return None  # 已经设置代理,不再处理

        if self.proxy_list:
            proxy = random.choice(self.proxy_list)
            request.meta['proxy'] = proxy
            request.meta['download_timeout'] = 30
            self.logger.info(f"为 {request.url} 分配代理: {proxy}")
        return None

    def process_response(self, request, response, spider):
        # 记录代理返回的非正常状态码
        if response.status in [403, 404, 500]:
            proxy = request.meta.get('proxy')
            if proxy:
                self.logger.warning(f"代理 {proxy} 返回了 {response.status} 状态码")
        return response

    def process_exception(self, request, exception, spider):
        # 代理请求异常时触发
        proxy = request.meta.get('proxy')
        if proxy:
            self.logger.error(f"代理 {proxy} 请求异常: {exception}")

        retry_times = request.meta.get('proxy_retry_times', 0)
        if retry_times < self.retry_times:
            new_request = request.copy()
            new_request.meta['proxy_retry_times'] = retry_times + 1
            new_request.dont_filter = True  # 避免被 Scrapy 重复过滤
            return new_request
        return None

Configuration Example(settings.py):

PROXY_LIST = [
    'http://user:pass@proxy1.com:8080',
    'https://proxy2.com:8080',
]
PROXY_RETRY_TIMES = 3

Agent pool management system

When the number of agents increases, simple list management is no longer sufficient. We need a dedicated manager to maintain the quality, availability of the agent, and achieve efficient access. The following takes Redis as an example to implement a high-performance proxy pool manager.

Redis-based proxy pool

import redis
import json
import time

class RedisProxyPoolManager:
    """基于 Redis 的代理池管理器"""

    def __init__(self, redis_host='localhost', redis_port=6379, redis_db=0):
        self.redis_client = redis.Redis(
            host=redis_host, port=redis_port, db=redis_db, decode_responses=True
        )
        self.pool_key = 'proxy_pool:available'   # 可用代理有序集合
        self.bad_key = 'proxy_pool:bad'          # 失效代理集合

    def add_proxy(self, proxy, proxy_type='http'):
        """将代理加入池中,初始分数为 100"""
        proxy_info = {
            'proxy': proxy,
            'type': proxy_type,
            'added_time': time.time(),
            'success_count': 0,
            'failure_count': 0,
            'score': 100
        }
        # 使用有序集合,分数为代理的当前得分
        self.redis_client.zadd(self.pool_key, {json.dumps(proxy_info): proxy_info['score']})

    def get_proxy(self):
        """获取当前分数最高的代理"""
        proxies = self.redis_client.zrevrange(self.pool_key, 0, 0, withscores=True)
        if proxies:
            proxy_info = json.loads(proxies[0][0])
            return proxy_info['proxy']
        return None

    def mark_proxy_good(self, proxy):
        """代理使用成功,提高分数"""
        self._update_proxy_score(proxy, 5)

    def mark_proxy_bad(self, proxy):
        """代理使用失败,降低分数"""
        self._update_proxy_score(proxy, -20)

    def _update_proxy_score(self, target_proxy, delta):
        """更新指定代理的得分,并确保分数在 0-100 之间"""
        all_proxies = self.redis_client.zrange(self.pool_key, 0, -1, withscores=True)
        for proxy_str, score in all_proxies:
            proxy_info = json.loads(proxy_str)
            if proxy_info['proxy'] == target_proxy:
                new_score = max(0, min(100, score + delta))
                # 先删除旧记录,再添加新记录
                self.redis_client.zrem(self.pool_key, proxy_str)
                proxy_info['score'] = new_score
                self.redis_client.zadd(self.pool_key, {json.dumps(proxy_info): new_score})
                break

Design Description:

  • Use Redis ordered collections to use score as a quantitative indicator of agent quality.
  • Successful agents will gain points, failed agents will lose points. If the score is too low, they will be eliminated naturally.
  • Automatically add a timestamp when inserting to facilitate regular cleaning of agents that have not been used for a long time.

Dynamic proxy switching strategy

After having a proxy pool, an intelligent switching strategy is also needed so that the crawler can automatically select the best proxy and switch in time when the proxy fails.

Intelligent proxy switching middleware

The following middleware will count the success rate, response time and number of consecutive failures of each agent, calculate a dynamic score based on these indicators, and then select the agent in a weighted random manner.

import random
import time
from collections import defaultdict, deque

class SmartProxySwitchMiddleware:
    """智能代理切换中间件"""

    def __init__(self):
        self.proxy_stats = defaultdict(lambda: {
            'success_count': 0,
            'failure_count': 0,
            'consecutive_failures': 0,
            'score': 100,
            'response_times': deque(maxlen=10)  # 记录最近 10 次响应时间
        })
        self.switch_threshold = 3   # 连续失败多少次后降低分数

    def process_request(self, request, spider):
        available = self._get_available_proxies()
        if available:
            selected = self._select_proxy(available)
            request.meta['proxy'] = selected
            request.meta['request_start_time'] = time.time()
        return None

    def process_response(self, request, response, spider):
        proxy = request.meta.get('proxy')
        if proxy:
            stats = self.proxy_stats[proxy]
            if response.status == 200:
                stats['success_count'] += 1
                stats['consecutive_failures'] = 0
                if 'request_start_time' in request.meta:
                    rt = time.time() - request.meta['request_start_time']
                    stats['response_times'].append(rt)
            else:
                stats['failure_count'] += 1
                stats['consecutive_failures'] += 1

            self._update_proxy_score(proxy)
        return response

    def _get_available_proxies(self):
        """获取分数 ≥ 30 的可用代理"""
        return [p for p, s in self.proxy_stats.items() if s['score'] >= 30]

    def _select_proxy(self, available):
        """按分数进行加权随机选择"""
        scores = [self.proxy_stats[p]['score'] for p in available]
        total = sum(scores)
        if total <= 0:
            return random.choice(available)
        weights = [s / total for s in scores]
        return random.choices(available, weights=weights)[0]

    def _update_proxy_score(self, proxy):
        """根据成功率、响应时间、连续失败次数更新分数"""
        stats = self.proxy_stats[proxy]
        total_req = stats['success_count'] + stats['failure_count']
        success_rate = stats['success_count'] / max(1, total_req)

        # 成功率部分最高 60 分
        success_score = success_rate * 60

        # 响应时间部分最高 40 分(平均响应时间越小分数越高)
        if stats['response_times']:
            avg_time = sum(stats['response_times']) / len(stats['response_times'])
        else:
            avg_time = 0
        time_score = max(0, 40 - (avg_time * 10))

        # 连续失败惩罚,每次失败扣 10 分,最多扣 50
        failure_penalty = min(stats['consecutive_failures'] * 10, 50)

        stats['score'] = max(0, success_score + time_score - failure_penalty)

Strategy Points:

  • Weighted random selection avoids all requests going to the same "best" proxy, reducing the risk of the proxy being blocked.
  • Response time affects the score, and agents that respond too slowly will be gradually abandoned.
  • Rapid punishment for consecutive failures, allowing failed agents to quickly exit the available list.

Frequently Asked Questions and Best Practices

FAQ

  1. Agent connection timeout Set up a reasonabledownload_timeout(Recommended about 30 seconds), combined with the retry mechanism of the middleware. When a timeout occurs, mark the agent as failed and try again.

  2. The proxy IP is blocked by the target website Implement an agent quality scoring system to eliminate low-quality agents in a timely manner. cooperate at the same timeDOWNLOAD_DELAYAutoThrottleMechanisms such as this control the frequency of requests to avoid triggering anti-crawling.

  3. Agent switching is too frequent By setting a continuous failure threshold (e.g.switch_threshold = 3), avoid changing agents due to an accidental failure, and reduce unnecessary overhead.

Best Practices

  • Small-scale crawler: Use a small number of paid high-anonymity agents directly, without the need for a complicated agent pool.
  • Medium-scale crawler: Build a lightweight proxy pool based on Redis, combining quality detection and automatic elimination.
  • Large-scale crawler: Build your own proxy pool cluster to achieve intelligent routing, real-time monitoring and automatic expansion.
  • Secure Computing: All newly added proxies undergo availability verification before entering the pool; sensitive data requests are forced to use HTTPS proxies.
  • Performance Optimization: Reuse proxy connections (enableHTTPCONNECTIONPool), use asynchronous requests to reduce the cost of establishing a connection for the agent.

💡 Core Point: The proxy IP pool is the infrastructure for large-scale crawlers. Through reasonable management strategies and quality control, you can significantly improve the stability and success rate of crawlers and calmly deal with various anti-crawling challenges.


🔗 Recommended related tutorials