Complete Guide to Spider Middleware - Detailed explanation of data pre-processing and post-processing technology
📂 Stage: Stage 5 - Combat Power Upgrade (Distributed and Advanced) 🔗 Related chapters: Downloader Middleware · Pipeline管道实战
Spider Middleware is a very flexible extension point in Scrapy. It runs directly between the engine and your Spider, giving you the opportunity to perform unified processing when data flows into/out of the Spider. Use it well, and you can easily implement common functions such as coding repair, data cleaning, exception degradation, and front-end deduplication without having to copy and paste the same code in each Spider.
This article will take you starting from the most basic concepts, step by step to master the configuration and three core methods of Spider Middleware, and write reusable middleware based on actual scenarios. Finally, we will also share our experience and best practices in pit elimination to help you avoid detours.
Table of contents
Core Basics and Position
In the Scrapy architecture, Spider Middleware is a hook between the engine and Spider. It mainly intercepts two types of data:
- Input direction: Response passed from the engine to Spider - preprocessing can be done here, such as cleaning HTML, repairing encoding, and identifying anti-crawling pages.
- Output direction: Item and Request generated by Spider - post-processing can be done here, such as supplementing metadata, text cleaning, and filtering duplicate data.
Simplify request flow: Engine → Scheduler → Downloader → ① Spider Middleware (input layer) → Spider → ② Spider Middleware (output layer) → Engine → Pipeline / Scheduler
In other words, Spider Middleware stands right before and after "parsing data" and is very suitable for data pre-processing and post-processing logic that is common to the entire site.
Configuration and minimalist life cycle
Activate middleware
existsettings.pypassSPIDER_MIDDLEWARESDictionary configuration, the key is the middleware path, and the value is the priority (0-1000). The smaller the number, the earlier the input response is processed; the smaller the number, the earlier the output result is processed (because the output has to go through the middleware chain in the opposite order).
For example, the following configuration:
Explain the execution sequence:
- Input phase (before response reaches Spider): Go through first
AntiCrawlEncodingFixMiddleware(priority 400), and thenEnrichCleanMiddleware(450)…… - Output stage (after Spider generates results): Because the smaller the priority, the middleware is further outside, so
DistributedPreDedupMiddleware(600) Process first, thenClassifiedExceptionMiddleware(500), and finally it’s the earliest middleware’s turn. The order of the output chain is exactly the opposite of the input chain.
Three methods that must be mastered
Spider Middleware has four hook methods, the most commonly used are the following three (the fourthprocess_start_requestsIt is rarely used and will not be expanded upon in this article.)
process_spider_input(response, spider)
Fires before the response enters the Spider.
- return
None: Continue to pass to the next middleware. - Throw an exception: interrupt the input process and enter the exception-handling chain.
process_spider_output(response, result, spider)
Triggered after Spider produces a result (Item or Request),resultis the iterator returned by Spider.
- Must return an iterable object (it is recommended to use the generator directly to save memory).
- Output items/requests can be deleted, replaced or added here.
process_spider_exception(response, exception, spider)
whenprocess_spider_inputOr triggered when the Spider callback itself throws an exception.
- return
None: The exception will continue to be passed upward. - Return an iterable object: The exception is considered to have been handled, and Scrapy will continue to process the iterable object (such as returning a new Request or skipping the page with an empty list).
Three core methods in practice
Let's get started directly and use three single-file middleware to cover the most common input pre-processing, output post-processing and exception classification scenarios.
3.1 Input preprocessing: anti-crawling detection + encoding correction
Many websites will return short anti-crawling pages (such as verification code pages), or the encoding of the Response statement is incorrect. We can solve both problems in one middleware.
Key Points:
- by throwing
IgnoreRequestDiscard the response directly to avoid subsequent meaningless parsing. - Revise
response._encodingWill affect Scrapy's text decoding, make sureresponse.textDisplayed in correct encoding.
3.2 Output post-processing: data enhancement + text cleaning
The data produced by Spider often needs to be cleaned uniformly (such as removing unnecessary line breaks and spaces), and some meta-information (source URL, crawl time, content MD5) should be added. This set of operations is very suitable for placement in output middleware.
In this way, all Items that pass through this middleware will automatically have clean text and standardized auxiliary fields, which is very convenient in actual combat.
3.3 Exception classification processing: retry vs skip
It is inevitable to encounter network fluctuations or page structure changes when the crawler is running. We can decide whether to retry the request or skip it directly based on the exception type to avoid a page fault causing the entire crawler to interrupt.
In this way, your Spider will become very robust: network problems are automatically retried, parsing problems are automatically skipped, and other unknown exceptions retain the original behavior.
High-frequency scenario: distributed front-end deduplication
Many projects use Item Pipeline to deduplicate data, but the disadvantage of Pipeline is that it waits for Spider to parse all the data before deduplicating data. This wastes CPU and network resources during the parsing phase. A more efficient approach is pre-emptive deduplication - filtering out duplicate data during the output stage of Spider Middleware.
The following middleware uses the Set structure of Redis, based on the previously generated_item_md5Perform deduplication to prevent duplicate data from entering the Pipeline.
Configuration Reminder: This middleware depends on the previousEnrichCleanMiddlewareadd first_item_md5, so you need to pay attention to the order of middleware priority - The priority of data enhancement middleware must be smaller than (smaller value) deduplication middleware, so as to ensure that MD5 has been generated before deduplication.
Pitfall Avoidance Guide & Best Practices
⚠️ Common pitfalls
-
Inversion of priority leads to disordered output order Input and output are processed in exactly the opposite order. An easy mistake to make is to write the priority of "enhancement first and then deduplication" as 500 (enhancement) and 400 (deduplication), which will lead to deduplication.
_item_md5Not generated yet. Remember: In the output chain, the one with the highest priority number is executed first. -
Memory Leak
- Do not save large amounts of data (such as global cache) in middleware instances, otherwise the memory will be exhausted as the crawler runs longer. For caching, use
functools.lru_cacheOr external storage such as Redis.process_spider_outputTry to yield the result directly instead of converting it into a list first and then returning it, otherwise the memory usage will soar instantly.
- Exceptions are "eaten" by errors
exist
process_spider_exception, only return an iterable if you can actually handle the exception (e.g.[]). If returnedNone, the exception will continue to be passed upward; if an iterable object is returned, Scrapy will consider that the exception has been handled, which will suppress errors you did not expect. So, be sure to keep detailed logs before returning.
🎯 Best Practices
-
Single Responsibility Each middleware only focuses on doing 1~2 strongly related things. For example, above we split "anti-crawling + encoding", "enhancement + cleaning", "exception classification", and "duplication removal" instead of stuffing all logic into one middleware.
-
Parameters configurable Values that are easy to change, such as the number of retries and Redis addresses, can be passed
settingsandfrom_crawlermethod to maintain the reusability of middleware. -
Pre-deduplication is better than post-deduplication Try to deduplicate the data before it enters the Pipeline, which can significantly reduce the waste of subsequent resources. Combined with distributed solutions such as Scrapy-Redis, it can also remove duplication at the cluster level.
-
Lightweight processing principle Spider Middleware runs in an asynchronous loop and does not perform CPU-intensive operations (such as large image OCR, complex NLP). This type of task is suitable for processing in asynchronous tasks such as Pipeline or Celery.
🔗 Recommended related tutorials
- Downloader Middleware - request response processing
- Pipeline管道实战 - data processing pipeline
- Selector 选择器 - Data extraction technology
🏷️ tag cloud:Scrapy Spider Middleware 数据预处理 数据后处理 exception-handling 网络爬虫

