Complete Guide to Downloader Middleware - Detailed explanation of request response interception and anti-crawling strategies

📂 Stage: Stage 3 - Offensive and Defense Drills (Middleware and Anti-Climbing) 🔗 Related chapters: Spider 实战 · Pipeline管道实战

In the world of Scrapy, after you have worked so hard to write the crawler, you will find that it runs into obstacles every time you go online: it is either blocked by various verification codes, or a bunch of garbled codes are returned, or even worse, the IP is blocked directly. At this time, it’s time to invite the “magic weapon” Downloader Middleware to the stage. It is like Scrapy's intelligent diplomat, which can flexibly step in to help you deal with the website before you formally initiate a request and after receiving a response.

This article will take you from concept to practice, and thoroughly play with the powerful tool Middleware.

Table of contents

Basic concepts of Middleware

Downloader MiddlewareLocated between the Scrapy Engine and the Downloader, it is the core hub of the anti-crawling strategy. Everything from the request (Request) you send to the response (Response) returned by the server must first pass through its territory.

Its main work can be summarized into the following four items:

  1. Request Interception and Disguise: Before the request is sent out, put it into an outfit. For example, modify the request headers (Headers), add authentication tokens (Token), set proxy IP, etc., to make our crawler look more like ordinary users.
  2. Response to inspection and processing: After receiving the data returned by the server, don’t rush to use it. The middleware can help you "check the product" first to see if it has been reverse-crawled, whether it has jumped to the verification code page, or whether the encrypted data needs to be decrypted.
  3. Abnormal Crisis Public Relations: When network timeout, DNS resolution failure and other accidents occur during the download process, the middleware can activate emergency plans, such as retrying gracefully, or simply changing the proxy and fighting again to avoid the crawler crashing directly.
  4. Anti-crawling strategy training ground: The main battlefield for dealing with website anti-crawling, such as User-Agent rotation, IP proxy pool, request frequency limit, etc., is here.

Let’s take a look at the complete “journey” of a request in Scrapy:

引擎(Engine)→ 调度器(Scheduler)→ 引擎(Engine)

             下载器中间件(Downloader Middleware)

             下载器(Downloader)

引擎(Engine)→ 爬虫中间件(Spider Middleware)→ 爬虫(Spider)

It can be seen that the downloader middleware is on the only path between requests and responses, and no data flow can escape its eyes.

Middleware life cycle

Each Downloader Middleware runtime has lifecycle methods that allow us to "step in" to intervene in the processing of requests and responses at different points in time.

Let’s first understand what the most basic middleware looks like:

class BaseMiddleware:
    """一个下载器中间件的标准模板"""

    @classmethod
    def from_crawler(cls, crawler):
        """工厂方法:Scrapy会调用它来创建中间件实例,同时可以拿到全局配置crawler对象"""
        return cls()

    def process_request(self, request, spider):
        """处理即将发出的请求"""
        # 返回None:放行,让这个请求继续走后面的流程。
        # 返回Response对象:直接拦截,不再下载,把这个Response当作最终结果。
        # 返回Request对象:将当前请求拦截,改为发起一个新的请求。
        # 抛出IgnoreRequest异常:直接丢弃这个请求,无响应返回。
        return None

    def process_response(self, request, response, spider):
        """处理下载完成的响应"""
        # 这个方法必须返回一个Response对象,可以是原始响应,也可以是新的。
        return response

    def process_exception(self, request, exception, spider):
        """处理下载过程中抛出的异常"""
        # 返回Response:表示异常已处理,用这个响应替代报错。
        # 返回Request:重试操作,返回一个新的请求。
        # 返回None:表示这个异常我没辙了,让下一个中间件来处理。
        return None

After writing, don’t forget tosettings.py"Register" it in:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    # 键是中间件的路径,值是优先级(数字越小越先执行)
    'myproject.middlewares.UserAgentMiddleware': 543,
    'myproject.middlewares.ProxyMiddleware': 544,
    'myproject.middlewares.CookiesMiddleware': 545,
}

# 你可能会好奇怎么定这个优先级数字,记住这几点就行:
# · Scrapy内置中间件的优先级范围在 0-1000。
# · 我们自定义的中间件,通常推荐放在 500-1000 之间。
# · 100 这个数字,一般是 Downloader 内部的界限,太靠前小心抢戏。

Core processing method

Below we go into the specific usage of these three core methods. Understand their logic and you will get the key to controlling Middleware.

Request processing

process_requestIt is the last stop before a request is made and is best used to add common configuration or authentication information to the request.

class BasicRequestMiddleware:
    """基础请求处理中间件"""

    def process_request(self, request, spider):
        # 给请求头添加默认值,不会覆盖已有的设置
        request.headers.setdefault('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
        request.headers.setdefault('Accept-Language', 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3')

        # 从全局settings里拿到动态的认证Token
        auth_token = spider.crawler.settings.get('AUTH_TOKEN')
        if auth_token:
            request.headers['Authorization'] = f'Bearer {auth_token}'

        return None

Response processing

process_responseIs the checkpoint before the response reaches the Spider. The anti-piracy messages you often encounter, such as "Visited too frequently" and "Please enter the verification code", are best dealt with here.

class RetryResponseMiddleware:
    """响应重试中间件:识别反爬页面并自动重试"""

    def process_response(self, request, response, spider):
        response_text = response.text.lower()

        # 定义一个常见的反爬关键词黑名单
        anti_crawl_indicators = [
            '访问过于频繁', '请稍后重试', 'blocked', 'forbidden',
            '验证码', 'captcha', 'rate limit', 'too many requests'
        ]

        # 如果命中了黑名单,说明需要重试
        if any(indicator in response_text for indicator in anti_crawl_indicators):
            retry_times = request.meta.get('retry_times', 0)
            max_retries = spider.crawler.settings.getint('MAX_RETRY_TIMES', 3)

            if retry_times < max_retries:
                spider.logger.info(f"检测到反爬,正在重试 {request.url},第 {retry_times + 1} 次")
                new_request = request.copy()
                new_request.meta['retry_times'] = retry_times + 1
                new_request.dont_filter = True
                return new_request

        return response

exception-handling

When network fluctuations or server convulsions cause requests to fail,process_exceptionIt becomes a life-saving straw, allowing us to handle exceptions gracefully.

from twisted.internet.error import TimeoutError, DNSLookupError

class BasicExceptionMiddleware:
    """基础exception-handling中间件:处理网络超时"""

    def process_exception(self, request, exception, spider):
        if isinstance(exception, TimeoutError):
            spider.logger.warning(f"请求超时:{request.url}")
            return self.handle_retry(request, exception, spider)

        return None

    def handle_retry(self, request, exception, spider):
        retry_times = request.meta.get('retry_times', 0)
        max_retries = spider.crawler.settings.getint('MAX_RETRY_TIMES', 2)

        if retry_times < max_retries:
            new_request = request.copy()
            new_request.meta['retry_times'] = retry_times + 1
            new_request.dont_filter = True
            return new_request

        return None

User-Agent rotation strategy

The first threshold for website anti-climbing is often yours.User-Agent(A string that identifies your browser). Always using an old UA string to access is tantamount to reporting your home address. Rotating UA is the cheapest and most effective disguise.

import random

class UserAgentMiddleware:
    """负责给每个请求随机挑选一件‘浏览器外衣’"""

    def __init__(self):
        self.user_agents = [
            # Windows Chrome
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            # Mac Chrome
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            # Windows Firefox
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0',
            # Mac Safari
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15',
            # iPhone Safari
            'Mozilla/5.0 (iPhone; CPU iPhone OS 17_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Mobile/15E148 Safari/604.1',
        ]

    def process_request(self, request, spider):
        # 关键一步,干活前给你换件衣服
        ua = random.choice(self.user_agents)
        request.headers['User-Agent'] = ua
        return None

Proxy IP management

If websites start to block your IP, it means you need a group of "vests". Proxy IP management is a technical task, which requires the ability to manage a proxy pool and perform intelligent switching based on the health status of the proxy. The following code example builds a basic proxy middleware with health checks.

import random
from collections import defaultdict

class ProxyMiddleware:
    """智能代理中间件,能记住哪个代理干的好,哪个总掉链子"""

    def __init__(self):
        # 代理池,实际场景可以从配置或API动态获取
        self.proxy_pool = [
            'http://proxy1:port',
            'http://proxy2:port',
            'http://proxy3:port',
        ]

        # 给代理们建立一份成绩单
        self.proxy_stats = defaultdict(lambda: {
            'success_count': 0,
            'failure_count': 0,
            'ban_count': 0
        })

    def process_request(self, request, spider):
        if request.meta.get('proxy'):
            return None

        proxy = self._select_proxy()
        if proxy:
            request.meta['proxy'] = proxy

        return None

    def process_response(self, request, response, spider):
        proxy = request.meta.get('proxy')
        if proxy:
            if response.status in [200, 301, 302]:
                self.proxy_stats[proxy]['success_count'] += 1
            elif response.status in [403, 404, 429, 503]:
                self.proxy_stats[proxy]['failure_count'] += 1
                if response.status == 403:
                    self.proxy_stats[proxy]['ban_count'] += 1

        return response

    def process_exception(self, request, exception, spider):
        proxy = request.meta.get('proxy')
        if proxy:
            self.proxy_stats[proxy]['failure_count'] += 1

    def _select_proxy(self):
        """只选择历史成绩好的‘优等生’代理"""
        healthy_proxies = []

        for proxy, stats in self.proxy_stats.items():
            if stats['failure_count'] < 5 and stats['ban_count'] < 3:
                healthy_proxies.append(proxy)

        if not healthy_proxies:
            # 实在没有优等生了,就随机选一个碰碰运气
            return random.choice(self.proxy_pool) if self.proxy_pool else None

        return random.choice(healthy_proxies)

Cookies Management

Scenarios such as logging in, shopping cart, and personalized recommendations are all inseparable from cookies. A good Cookies middleware can help you manage this session information.

import time
from urllib.parse import urlparse

class CookiesMiddleware:
    """按域名管理Cookies,维持会话状态"""

    def __init__(self):
        self.domain_cookies = {}

    def process_request(self, request, spider):
        domain = self._extract_domain(request.url)
        cookies = self.domain_cookies.get(domain, {})

        if cookies:
            cookie_header = '; '.join([f"{name}={data['value']}"
                                     for name, data in cookies.items()])
            request.headers['Cookie'] = cookie_header

        return None

    def process_response(self, request, response, spider):
        domain = self._extract_domain(request.url)
        set_cookies = response.headers.getlist('Set-Cookie')

        for cookie_header in set_cookies:
            cookie_data = self._parse_cookie(cookie_header.decode('utf-8'))
            if cookie_data:
                if domain not in self.domain_cookies:
                    self.domain_cookies[domain] = {}

                self.domain_cookies[domain][cookie_data['name']] = {
                    'value': cookie_data['value'],
                    'timestamp': time.time()
                }

        return response

    def _extract_domain(self, url):
        parsed = urlparse(url)
        return parsed.netloc

    def _parse_cookie(self, cookie_str):
        parts = cookie_str.split(';')
        if not parts:
            return None

        name_value = parts[0].split('=', 1)
        if len(name_value) != 2:
            return None

        return {
            'name': name_value[0].strip(),
            'value': name_value[1].strip()
        }

Request delay and speed limit

Controlled access is the law of long-term survival of crawlers. Setting appropriate download delays and randomization can make your crawler behave more like humans.

import time
import random
from collections import defaultdict

class RateLimitMiddleware:
    """为每个域名设置独立的访问节奏"""

    def __init__(self):
        self.domain_last_request = defaultdict(float)

    def process_request(self, request, spider):
        domain = self._extract_domain(request.url)

        min_delay = spider.crawler.settings.getfloat('DOWNLOAD_DELAY', 1)
        randomize_delay = spider.crawler.settings.getbool('RANDOMIZE_DOWNLOAD_DELAY', True)

        delay = min_delay
        if randomize_delay:
            # 让请求间隔在0.5倍到1.5倍基础延迟之间随机浮动
            delay = random.uniform(min_delay * 0.5, min_delay * 1.5)

        current_time = time.time()
        wait_time = (self.domain_last_request[domain] + delay) - current_time

        if wait_time > 0:
            time.sleep(wait_time)

        self.domain_last_request[domain] = time.time()
        return None

    def _extract_domain(self, url):
        from urllib.parse import urlparse
        return urlparse(url).netloc

Frequently Asked Questions and Solutions

A pitfall guide to help you avoid detours in middleware development.

Problem 1: Middleware does not take effect

Phenomena: Insettings.pyThere is middleware installed, but its code does not run at all.

Solution:

# 1. 重点检查settings.py里的路径写对没,类名大小写敏感!
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyMiddleware': 543,  # 确认这个路径能直接 import 到你的类
}

# 2. 去对应的 Python 文件里,检查类名是否一致,有没有忘记 import 依赖。
# 3. 看看 Spider 运行时控制台有无报错,如果中间件初始化就挂了,Scrapy 会禁用它的。

Problem 2: The request is redirected infinitely

Phenomena: The crawler gets stuck on one request and keeps returning new requests, seemingly never ending.

Solution:

class SafeRedirectMiddleware:
    def process_request(self, request, spider):
        # 给重定向次数加上限,避免跳进无底洞
        redirect_times = request.meta.get('redirect_times', 0)
        if redirect_times > 5:
            return None

        # ...这里是你的重定向逻辑
        new_request = request.replace(url=new_url)
        new_request.meta['redirect_times'] = redirect_times + 1
        return new_request

Problem 3: Agent switching is not timely

Phenomenon: It is clear that a certain agent has been blocked by the website, but subsequent requests are still being used foolishly.

Solution:

class SmartProxyMiddleware:
    def process_response(self, request, response, spider):
        # 一旦发现代理被禁或受限,就地正法并马上发起重试
        if response.status in [403, 429, 503]:
            current_proxy = request.meta.get('proxy')
            if current_proxy:
                self.mark_proxy_failed(current_proxy)
                # 构造一个新请求,打上'换代理'的标记
                new_request = request.copy()
                new_request.meta['change_proxy'] = True
                new_request.dont_filter = True
                return new_request

        return response

Best practice recommendations

To build a robust and elegant middleware system, here are some tips for you.

Design principles

  1. Modularity: Don’t squeeze UAS rotation, IP proxy, and Cookies management into one middleware. Each middleware only focuses on one thing, making it easy to combine and troubleshoot.
  2. Configurable: Hard coding is the root of all evil. Any parameters that may change (number of retries, delay seconds, etc.) are passedsettings.pyOr custom configuration to take over.
  3. Robustness: Your middleware code should be as reliable as a Swiss Army Knife, properly handling all potential exceptions to avoid causing the entire crawler to hang due to a small middleware error.
  4. Performance Considerations:process_requestandprocess_responseIt is a performance-sensitive area. Try to avoid doing complex calculations or blocking IO operations here.

Security and Compliance

  1. Privacy Protection: If you process sensitive information such as login status and token in the middleware, ensure that they do not appear in public logs.
  2. Compliance: A gentleman has certain things to do and things not to do. Always pay attention to the target website’srobots.txtAgreement with users, choose an appropriate anti-crawling strategy, and use your technology within a reasonable range.
  3. Frequency Control: Have respect for the target server. Reasonably control the request frequency and don't let your crawler become a destructive attack on other people's websites.

💡 Core Points Downloader Middleware is your steering wheel for driving Scrapy and dealing with complex anti-crawling environments. From request to response, it gives you unparalleled control. Remember, the most effective anti-crawling strategies come from careful observation of target website behavior and the right combination of these tools. Now, go ahead and customize your strategy for your Cyber ​​Diplomat!