Complete Guide to Spider Middleware - Detailed explanation of data pre-processing and post-processing technology

📂 Stage: Stage 5 - Combat Power Upgrade (Distributed and Advanced) 🔗 Related chapters: Downloader Middleware · Pipeline管道实战


Spider Middleware is a very flexible extension point in Scrapy. It runs directly between the engine and your Spider, giving you the opportunity to perform unified processing when data flows into/out of the Spider. Use it well, and you can easily implement common functions such as coding repair, data cleaning, exception degradation, and front-end deduplication without having to copy and paste the same code in each Spider.

This article will take you starting from the most basic concepts, step by step to master the configuration and three core methods of Spider Middleware, and write reusable middleware based on actual scenarios. Finally, we will also share our experience and best practices in pit elimination to help you avoid detours.


Table of contents


Core Basics and Position

In the Scrapy architecture, Spider Middleware is a hook between the engine and Spider. It mainly intercepts two types of data:

  1. Input direction: Response passed from the engine to Spider - preprocessing can be done here, such as cleaning HTML, repairing encoding, and identifying anti-crawling pages.
  2. Output direction: Item and Request generated by Spider - post-processing can be done here, such as supplementing metadata, text cleaning, and filtering duplicate data.

Simplify request flow: Engine → Scheduler → Downloader → ① Spider Middleware (input layer) → Spider → ② Spider Middleware (output layer) → Engine → Pipeline / Scheduler

In other words, Spider Middleware stands right before and after "parsing data" and is very suitable for data pre-processing and post-processing logic that is common to the entire site.


Configuration and minimalist life cycle

Activate middleware

existsettings.pypassSPIDER_MIDDLEWARESDictionary configuration, the key is the middleware path, and the value is the priority (0-1000). The smaller the number, the earlier the input response is processed; the smaller the number, the earlier the output result is processed (because the output has to go through the middleware chain in the opposite order).

For example, the following configuration:

# settings.py
SPIDER_MIDDLEWARES = {
    'myproject.middlewares.AntiCrawlEncodingFixMiddleware': 400,
    'myproject.middlewares.EnrichCleanMiddleware': 450,
    'myproject.middlewares.ClassifiedExceptionMiddleware': 500,
    'myproject.middlewares.DistributedPreDedupMiddleware': 600,
}

Explain the execution sequence:

  • Input phase (before response reaches Spider): Go through firstAntiCrawlEncodingFixMiddleware(priority 400), and thenEnrichCleanMiddleware(450)……
  • Output stage (after Spider generates results): Because the smaller the priority, the middleware is further outside, soDistributedPreDedupMiddleware(600) Process first, thenClassifiedExceptionMiddleware(500), and finally it’s the earliest middleware’s turn. The order of the output chain is exactly the opposite of the input chain.

Three methods that must be mastered

Spider Middleware has four hook methods, the most commonly used are the following three (the fourthprocess_start_requestsIt is rarely used and will not be expanded upon in this article.)

  1. process_spider_input(response, spider)
    Fires before the response enters the Spider.
  • returnNone: Continue to pass to the next middleware.
  • Throw an exception: interrupt the input process and enter the exception-handling chain.
  1. process_spider_output(response, result, spider)
    Triggered after Spider produces a result (Item or Request),resultis the iterator returned by Spider.
  • Must return an iterable object (it is recommended to use the generator directly to save memory).
  • Output items/requests can be deleted, replaced or added here.
  1. process_spider_exception(response, exception, spider)
    whenprocess_spider_inputOr triggered when the Spider callback itself throws an exception.
  • returnNone: The exception will continue to be passed upward.
  • Return an iterable object: The exception is considered to have been handled, and Scrapy will continue to process the iterable object (such as returning a new Request or skipping the page with an empty list).

Three core methods in practice

Let's get started directly and use three single-file middleware to cover the most common input pre-processing, output post-processing and exception classification scenarios.

3.1 Input preprocessing: anti-crawling detection + encoding correction

Many websites will return short anti-crawling pages (such as verification code pages), or the encoding of the Response statement is incorrect. We can solve both problems in one middleware.

import chardet
import re
from scrapy.exceptions import IgnoreRequest

class AntiCrawlEncodingFixMiddleware:
    """
    输入预处理:
    1. 快速检测反爬特征(仅检查小体积响应)
    2. 使用 chardet 自动修正编码
    """
    def process_spider_input(self, response, spider):
        # ---------- 反爬检测 ----------
        # 只对体积较小的响应做检查,避免浪费资源
        if len(response.body) < 10240:
            anti_keywords = ['验证码', 'blocked', '访问频繁', 'forbidden']
            if any(kw in response.text.lower() for kw in anti_keywords):
                spider.logger.warning(f"疑似反爬页面,跳过:{response.url}")
                raise IgnoreRequest("Anti-crawling detected")

        # ---------- 编码修正 ----------
        # chardet 检测编码,置信度超过 0.9 时才覆盖 header 中的编码
        detect = chardet.detect(response.body)
        if detect['confidence'] > 0.9:
            response._encoding = detect['encoding']
            response.meta['fixed_encoding'] = detect['encoding']  # 记录一下,便于后续查看
        return None

Key Points:

  • by throwingIgnoreRequestDiscard the response directly to avoid subsequent meaningless parsing.
  • Reviseresponse._encodingWill affect Scrapy's text decoding, make sureresponse.textDisplayed in correct encoding.

3.2 Output post-processing: data enhancement + text cleaning

The data produced by Spider often needs to be cleaned uniformly (such as removing unnecessary line breaks and spaces), and some meta-information (source URL, crawl time, content MD5) should be added. This set of operations is very suitable for placement in output middleware.

import time
import hashlib
import re
from itemadapter import ItemAdapter
from scrapy.item import Item

class EnrichCleanMiddleware:
    """输出后处理:文本清洗 + 元数据增强"""
    def process_spider_output(self, response, result, spider):
        for obj in result:
            if isinstance(obj, (dict, Item)):
                self._clean_text(obj)
                self._enrich_meta(obj, response)
                yield obj
            else:
                # Request 对象直接通过,不做修改
                yield obj

    def _clean_text(self, obj):
        adapter = ItemAdapter(obj)
        for k, v in adapter.items():
            if isinstance(v, str):
                # 把换行和连续空格替换成一个空格,并去掉首尾空白
                adapter[k] = re.sub(r'\s+', ' ', v.strip())

    def _enrich_meta(self, obj, response):
        adapter = ItemAdapter(obj)
        # 生成稳定的内容 MD5(用于后续去重)
        content_bytes = str(sorted(adapter.asdict().items())).encode()
        adapter['_item_md5'] = hashlib.md5(content_bytes).hexdigest()
        # 添加来源和时间
        adapter['_source_url'] = response.url
        adapter['_crawl_time'] = time.strftime('%Y-%m-%d %H:%M:%S')

In this way, all Items that pass through this middleware will automatically have clean text and standardized auxiliary fields, which is very convenient in actual combat.


3.3 Exception classification processing: retry vs skip

It is inevitable to encounter network fluctuations or page structure changes when the crawler is running. We can decide whether to retry the request or skip it directly based on the exception type to avoid a page fault causing the entire crawler to interrupt.

from scrapy.http import Request

class ClassifiedExceptionMiddleware:
    """
    异常分类处理:
    - 网络异常自动重试(可配置次数)
    - 解析异常直接跳过
    - 其他异常继续向上抛出
    """
    def __init__(self, max_retry=2):
        self.max_retry = max_retry

    @classmethod
    def from_crawler(cls, crawler):
        return cls(max_retry=crawler.settings.getint('SM_MAX_RETRY', 2))

    def process_spider_exception(self, response, exception, spider):
        exc_name = type(exception).__name__
        spider.logger.error(f"{exc_name} 处理失败:{response.url} | {exception}")

        # ---------- 网络类异常:重试 ----------
        network_errors = ['ConnectionError', 'TimeoutError', 'ConnectTimeout']
        if exc_name in network_errors:
            retry_cnt = response.meta.get('retry_cnt', 0) + 1
            if retry_cnt <= self.max_retry:
                spider.logger.info(f"第{retry_cnt}次重试:{response.url}")
                # 返回新 Request,dont_filter=True 保证被调度
                return [response.request.replace(dont_filter=True, meta={'retry_cnt': retry_cnt})]

        # ---------- 解析类异常:跳过 ----------
        parse_errors = ['ValueError', 'KeyError', 'IndexError']
        if exc_name in parse_errors:
            spider.logger.warning(f"解析失败,跳过页面:{response.url}")
            return []       # 返回空列表表示已处理异常,不再向上传递

        # 其他异常,不做处理,让 Scrapy 按默认逻辑处理
        return None

In this way, your Spider will become very robust: network problems are automatically retried, parsing problems are automatically skipped, and other unknown exceptions retain the original behavior.


High-frequency scenario: distributed front-end deduplication

Many projects use Item Pipeline to deduplicate data, but the disadvantage of Pipeline is that it waits for Spider to parse all the data before deduplicating data. This wastes CPU and network resources during the parsing phase. A more efficient approach is pre-emptive deduplication - filtering out duplicate data during the output stage of Spider Middleware.

The following middleware uses the Set structure of Redis, based on the previously generated_item_md5Perform deduplication to prevent duplicate data from entering the Pipeline.

import redis
from itemadapter import ItemAdapter
from scrapy.item import Item

class DistributedPreDedupMiddleware:
    """分布式前置去重,基于 Redis Set 和 Item MD5"""
    def __init__(self, redis_url, dedup_key, expire_days=7):
        self.redis = redis.from_url(redis_url)
        self.dedup_key = dedup_key
        self.expire_seconds = 86400 * expire_days

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            redis_url=crawler.settings.get('REDIS_URL', 'redis://localhost:6379'),
            dedup_key=crawler.settings.get('SM_DEDUP_KEY', 'spider:item_dedup:md5'),
            expire_days=crawler.settings.getint('SM_DEDUP_EXPIRE', 7)
        )

    def process_spider_output(self, response, result, spider):
        for obj in result:
            if isinstance(obj, (dict, Item)):
                item_md5 = ItemAdapter(obj).get('_item_md5')
                if item_md5 and self.redis.sismember(self.dedup_key, item_md5):
                    spider.logger.debug(f"已存在,前置去重跳过:{response.url}")
                    continue  # 直接丢弃重复 item
                # 添加到去重集合
                self.redis.sadd(self.dedup_key, item_md5)
                self.redis.expire(self.dedup_key, self.expire_seconds)
            yield obj

Configuration Reminder: This middleware depends on the previousEnrichCleanMiddlewareadd first_item_md5, so you need to pay attention to the order of middleware priority - The priority of data enhancement middleware must be smaller than (smaller value) deduplication middleware, so as to ensure that MD5 has been generated before deduplication.


Pitfall Avoidance Guide & Best Practices

⚠️ Common pitfalls

  1. Inversion of priority leads to disordered output order Input and output are processed in exactly the opposite order. An easy mistake to make is to write the priority of "enhancement first and then deduplication" as 500 (enhancement) and 400 (deduplication), which will lead to deduplication._item_md5Not generated yet. Remember: In the output chain, the one with the highest priority number is executed first.

  2. Memory Leak

  • Do not save large amounts of data (such as global cache) in middleware instances, otherwise the memory will be exhausted as the crawler runs longer. For caching, usefunctools.lru_cacheOr external storage such as Redis.
    • process_spider_outputTry to yield the result directly instead of converting it into a list first and then returning it, otherwise the memory usage will soar instantly.
  1. Exceptions are "eaten" by errors existprocess_spider_exception, only return an iterable if you can actually handle the exception (e.g.[]). If returnedNone, the exception will continue to be passed upward; if an iterable object is returned, Scrapy will consider that the exception has been handled, which will suppress errors you did not expect. So, be sure to keep detailed logs before returning.

🎯 Best Practices

  1. Single Responsibility Each middleware only focuses on doing 1~2 strongly related things. For example, above we split "anti-crawling + encoding", "enhancement + cleaning", "exception classification", and "duplication removal" instead of stuffing all logic into one middleware.

  2. Parameters configurable Values ​​that are easy to change, such as the number of retries and Redis addresses, can be passedsettingsandfrom_crawlermethod to maintain the reusability of middleware.

  3. Pre-deduplication is better than post-deduplication Try to deduplicate the data before it enters the Pipeline, which can significantly reduce the waste of subsequent resources. Combined with distributed solutions such as Scrapy-Redis, it can also remove duplication at the cluster level.

  4. Lightweight processing principle Spider Middleware runs in an asynchronous loop and does not perform CPU-intensive operations (such as large image OCR, complex NLP). This type of task is suitable for processing in asynchronous tasks such as Pipeline or Celery.


🔗 Recommended related tutorials

🏷️ tag cloud:Scrapy Spider Middleware 数据预处理 数据后处理 exception-handling 网络爬虫