Complete Guide to Scrapy Proxy IP Pool Integration

📂 Stage: Stage 3 - Offensive and Defense Drills (Middleware and Anti-Climbing) 🔗 Related chapters: Downloader Middleware · 反爬对抗实战

IP blocking is one of the most common challenges in large-scale crawling projects. A stable and efficient proxy IP pool can help crawlers disguise their identities, disperse request sources, and effectively avoid IP bans. This article will systematically explain the method of integrating proxy IP pools in Scrapy, covering core technologies such as dynamic proxy switching, proxy pool management, and quality inspection, to help you improve the stability and success rate of your crawler.

代理IP基础概念
代理IP类型与选择
基础代理中间件实现
代理池管理系统
动态代理切换策略
常见问题与最佳实践

Basic concepts of proxy IP

Proxy IP is an important means for crawlers to fight anti-crawling. Its core principle is that the client's request is no longer sent directly to the target server, but is forwarded through a proxy server, thus hiding the client's real IP.

客户端 → 代理服务器 → 目标服务器
  ↓        ↓        ↓
发起请求 → 转发请求 → 返回响应

Main categories of agents

Classification by protocol: HTTP, HTTPS, SOCKS4, SOCKS5
Classification by degree of anonymity:
Transparent Proxy: The target server can identify the real IP, not recommended for crawlers
Anonymous Proxy: Hide the real IP, but will tell the server that a proxy is used
High Anonymous Proxy: Completely hide the real IP, the server cannot detect the existence of the proxy (Highly Recommended)

💡 For crawlers, High Hidden Proxy is the most stable and safest choice.

Proxy IP type and selection

In actual projects, proxy IP can be obtained from different channels, and each method has its applicable scenarios.

Type	Advantages	Disadvantages	Applicable scenarios
Free agent	Zero cost	Poor stability, slow speed, low survival rate	Small-scale testing, learning exercises
Paid Agent	Good stability, fast speed, reliable service	Continuous payment required	Commercial projects, large-scale continuous collection
Self-built agency	Fully controllable, high security, flexible customization	High technical threshold, high initial cost	Long-term large-scale projects, high security requirements

For most teams, paid agents + self-built agent pool is the most cost-effective combination.

Basic proxy middleware implementation

In Scrapy, proxy switching is usually implemented through Downloader Middleware. Let's start with the simplest implementation and gradually build a usable proxy middleware.

Simple random proxy middleware

import random
import logging

class SimpleProxyMiddleware:
    """简单的随机代理中间件"""

    def __init__(self):
        self.proxies = [
            'http://proxy1.com:8080',
            'http://proxy2.com:8080',
            'http://proxy3.com:8080',
        ]
        self.logger = logging.getLogger(__name__)

    def process_request(self, request, spider):
        # 如果请求尚未设置代理，则随机分配一个
        if 'proxy' not in request.meta:
            proxy = random.choice(self.proxies)
            request.meta['proxy'] = proxy
            self.logger.info(f"为 {request.url} 分配代理: {proxy}")
        return None

Enabled method: insettings.pyAdd this middleware toDOWNLOADER_MIDDLEWARESConfiguring.

Middleware that supports configuration and retry

The following middleware supports reading a list of proxies from the configuration and the ability to retry failed proxies a limited number of times.

import random
import logging

class ConfigurableProxyMiddleware:
    """可配置的代理中间件，支持认证和重试"""

    def __init__(self, proxy_list, retry_times=3):
        self.proxy_list = proxy_list
        self.retry_times = retry_times
        self.logger = logging.getLogger(__name__)

    @classmethod
    def from_crawler(cls, crawler):
        # 从 Scrapy 配置中获取代理列表和重试次数
        proxy_list = crawler.settings.getlist('PROXY_LIST', [])
        retry_times = crawler.settings.getint('PROXY_RETRY_TIMES', 3)
        return cls(proxy_list, retry_times)

    def process_request(self, request, spider):
        if request.meta.get('proxy'):
            return None  # 已经设置代理，不再处理

        if self.proxy_list:
            proxy = random.choice(self.proxy_list)
            request.meta['proxy'] = proxy
            request.meta['download_timeout'] = 30
            self.logger.info(f"为 {request.url} 分配代理: {proxy}")
        return None

    def process_response(self, request, response, spider):
        # 记录代理返回的非正常状态码
        if response.status in [403, 404, 500]:
            proxy = request.meta.get('proxy')
            if proxy:
                self.logger.warning(f"代理 {proxy} 返回了 {response.status} 状态码")
        return response

    def process_exception(self, request, exception, spider):
        # 代理请求异常时触发
        proxy = request.meta.get('proxy')
        if proxy:
            self.logger.error(f"代理 {proxy} 请求异常: {exception}")

        retry_times = request.meta.get('proxy_retry_times', 0)
        if retry_times < self.retry_times:
            new_request = request.copy()
            new_request.meta['proxy_retry_times'] = retry_times + 1
            new_request.dont_filter = True  # 避免被 Scrapy 重复过滤
            return new_request
        return None

Configuration Example(settings.py）：

PROXY_LIST = [
    'http://user:pass@proxy1.com:8080',
    'https://proxy2.com:8080',
]
PROXY_RETRY_TIMES = 3

Agent pool management system

When the number of agents increases, simple list management is no longer sufficient. We need a dedicated manager to maintain the quality, availability of the agent, and achieve efficient access. The following takes Redis as an example to implement a high-performance proxy pool manager.

Redis-based proxy pool

import redis
import json
import time

class RedisProxyPoolManager:
    """基于 Redis 的代理池管理器"""

    def __init__(self, redis_host='localhost', redis_port=6379, redis_db=0):
        self.redis_client = redis.Redis(
            host=redis_host, port=redis_port, db=redis_db, decode_responses=True
        )
        self.pool_key = 'proxy_pool:available'   # 可用代理有序集合
        self.bad_key = 'proxy_pool:bad'          # 失效代理集合

    def add_proxy(self, proxy, proxy_type='http'):
        """将代理加入池中，初始分数为 100"""
        proxy_info = {
            'proxy': proxy,
            'type': proxy_type,
            'added_time': time.time(),
            'success_count': 0,
            'failure_count': 0,
            'score': 100
        }
        # 使用有序集合，分数为代理的当前得分
        self.redis_client.zadd(self.pool_key, {json.dumps(proxy_info): proxy_info['score']})

    def get_proxy(self):
        """获取当前分数最高的代理"""
        proxies = self.redis_client.zrevrange(self.pool_key, 0, 0, withscores=True)
        if proxies:
            proxy_info = json.loads(proxies[0][0])
            return proxy_info['proxy']
        return None

    def mark_proxy_good(self, proxy):
        """代理使用成功，提高分数"""
        self._update_proxy_score(proxy, 5)

    def mark_proxy_bad(self, proxy):
        """代理使用失败，降低分数"""
        self._update_proxy_score(proxy, -20)

    def _update_proxy_score(self, target_proxy, delta):
        """更新指定代理的得分，并确保分数在 0-100 之间"""
        all_proxies = self.redis_client.zrange(self.pool_key, 0, -1, withscores=True)
        for proxy_str, score in all_proxies:
            proxy_info = json.loads(proxy_str)
            if proxy_info['proxy'] == target_proxy:
                new_score = max(0, min(100, score + delta))
                # 先删除旧记录，再添加新记录
                self.redis_client.zrem(self.pool_key, proxy_str)
                proxy_info['score'] = new_score
                self.redis_client.zadd(self.pool_key, {json.dumps(proxy_info): new_score})
                break

Design Description:

Use Redis ordered collections to use score as a quantitative indicator of agent quality.
Successful agents will gain points, failed agents will lose points. If the score is too low, they will be eliminated naturally.
Automatically add a timestamp when inserting to facilitate regular cleaning of agents that have not been used for a long time.

Dynamic proxy switching strategy

After having a proxy pool, an intelligent switching strategy is also needed so that the crawler can automatically select the best proxy and switch in time when the proxy fails.

Intelligent proxy switching middleware

The following middleware will count the success rate, response time and number of consecutive failures of each agent, calculate a dynamic score based on these indicators, and then select the agent in a weighted random manner.

import random
import time
from collections import defaultdict, deque

class SmartProxySwitchMiddleware:
    """智能代理切换中间件"""

    def __init__(self):
        self.proxy_stats = defaultdict(lambda: {
            'success_count': 0,
            'failure_count': 0,
            'consecutive_failures': 0,
            'score': 100,
            'response_times': deque(maxlen=10)  # 记录最近 10 次响应时间
        })
        self.switch_threshold = 3   # 连续失败多少次后降低分数

    def process_request(self, request, spider):
        available = self._get_available_proxies()
        if available:
            selected = self._select_proxy(available)
            request.meta['proxy'] = selected
            request.meta['request_start_time'] = time.time()
        return None

    def process_response(self, request, response, spider):
        proxy = request.meta.get('proxy')
        if proxy:
            stats = self.proxy_stats[proxy]
            if response.status == 200:
                stats['success_count'] += 1
                stats['consecutive_failures'] = 0
                if 'request_start_time' in request.meta:
                    rt = time.time() - request.meta['request_start_time']
                    stats['response_times'].append(rt)
            else:
                stats['failure_count'] += 1
                stats['consecutive_failures'] += 1

            self._update_proxy_score(proxy)
        return response

    def _get_available_proxies(self):
        """获取分数 ≥ 30 的可用代理"""
        return [p for p, s in self.proxy_stats.items() if s['score'] >= 30]

    def _select_proxy(self, available):
        """按分数进行加权随机选择"""
        scores = [self.proxy_stats[p]['score'] for p in available]
        total = sum(scores)
        if total <= 0:
            return random.choice(available)
        weights = [s / total for s in scores]
        return random.choices(available, weights=weights)[0]

    def _update_proxy_score(self, proxy):
        """根据成功率、响应时间、连续失败次数更新分数"""
        stats = self.proxy_stats[proxy]
        total_req = stats['success_count'] + stats['failure_count']
        success_rate = stats['success_count'] / max(1, total_req)

        # 成功率部分最高 60 分
        success_score = success_rate * 60

        # 响应时间部分最高 40 分（平均响应时间越小分数越高）
        if stats['response_times']:
            avg_time = sum(stats['response_times']) / len(stats['response_times'])
        else:
            avg_time = 0
        time_score = max(0, 40 - (avg_time * 10))

        # 连续失败惩罚，每次失败扣 10 分，最多扣 50
        failure_penalty = min(stats['consecutive_failures'] * 10, 50)

        stats['score'] = max(0, success_score + time_score - failure_penalty)

Strategy Points:

Weighted random selection avoids all requests going to the same "best" proxy, reducing the risk of the proxy being blocked.
Response time affects the score, and agents that respond too slowly will be gradually abandoned.
Rapid punishment for consecutive failures, allowing failed agents to quickly exit the available list.

Frequently Asked Questions and Best Practices

FAQ

Agent connection timeout Set up a reasonabledownload_timeout(Recommended about 30 seconds), combined with the retry mechanism of the middleware. When a timeout occurs, mark the agent as failed and try again.
The proxy IP is blocked by the target website Implement an agent quality scoring system to eliminate low-quality agents in a timely manner. cooperate at the same timeDOWNLOAD_DELAY、AutoThrottleMechanisms such as this control the frequency of requests to avoid triggering anti-crawling.
Agent switching is too frequent By setting a continuous failure threshold (e.g.switch_threshold = 3), avoid changing agents due to an accidental failure, and reduce unnecessary overhead.

Best Practices

Small-scale crawler: Use a small number of paid high-anonymity agents directly, without the need for a complicated agent pool.
Medium-scale crawler: Build a lightweight proxy pool based on Redis, combining quality detection and automatic elimination.
Large-scale crawler: Build your own proxy pool cluster to achieve intelligent routing, real-time monitoring and automatic expansion.
Secure Computing: All newly added proxies undergo availability verification before entering the pool; sensitive data requests are forced to use HTTPS proxies.
Performance Optimization: Reuse proxy connections (enableHTTPCONNECTIONPool), use asynchronous requests to reduce the cost of establishing a connection for the agent.

💡 Core Point: The proxy IP pool is the infrastructure for large-scale crawlers. Through reasonable management strategies and quality control, you can significantly improve the stability and success rate of crawlers and calmly deal with various anti-crawling challenges.

🔗 Recommended related tutorials

Downloader Middleware – middleware mechanism and customization
反爬对抗实战 – Solutions for various anti-crawling scenarios
自动限速AutoThrottle – Intelligent control of request frequency to avoid accidental damage

#Complete Guide to Scrapy Proxy IP Pool Integration

#Table of contents

#Basic concepts of proxy IP

#Main categories of agents

#Proxy IP type and selection

#Basic proxy middleware implementation

#Simple random proxy middleware

#Middleware that supports configuration and retry

#Agent pool management system

#Redis-based proxy pool

#Dynamic proxy switching strategy

#Intelligent proxy switching middleware

#Frequently Asked Questions and Best Practices

#FAQ

#Best Practices