Pipeline Complete Practical Guide - Detailed explanation of data cleaning, verification, storage and processing processes
📂 Stage: Stage 2 - Data Flow (Data Processing) 🔗 Related chapters: Item 与 Item Loader · Downloader Middleware
The data captured by crawlers from web pages is often "rough" - in a confusing format, missing a lot, and full of repetitions. Scrapy's Pipeline is your exclusive "finished construction team" that can process messy original Items into clean, standardized, commercial-grade data that can be directly stored in the database.
This article will start with the most basic concepts, gradually lead you to build a reliable data processing pipeline, and provide practical code that can be directly reused. After reading, you will find: It turns out that standardized data processing can be so simple.
Table of contents
Pipeline Basics Introduction
Function and core value
In Scrapy, Pipeline is the "data processing chain" that follows Spider. After each Item is output from Spider, it will flow through each Pipeline component in sequence in the order you configured. In each component, you can complete:
- Clean Dirty Data: Remove excess spaces, garbled characters, HTML tags, and unify date, price and other formats
- Quality Verification: Check whether required fields are missing, and data that does not meet the requirements will be discarded directly.
- Removal: Avoid the same posts, products, and news from being collected repeatedly
- Sorting Storage: Write JSON files, MySQL, MongoDB, Redis and other different media on demand
Summary in one sentence: **Pipeline turns your scattered data into clean, usable, and traceable assets. **
Simplified version of workflow
The entire process is like a factory assembly line. Items are raw materials, and each Pipeline is a machine at a different station:
After the item leaves the Spider, it first enters the Pipeline with the highest priority (the smallest number). After processing, it is passed to the next one until it is stored or discarded midway.
Configure priority and life cycle
1. Priority configuration (settings.py)
This is where Pipeline is most prone to pitfalls - the smaller the number, the higher the priority and the earlier it will be executed. Many beginners put storage at the forefront. As a result, dirty data is written into the database before cleaning, which wastes resources and pollutes the data.
A reasonable sequence should be: Verification → Cleaning → Deduplication → Storage.
💡 Tips: It is customary to configure the priority in intervals of 100, so that you can insert a new Pipeline in the middle later without having to readjust all the numbers.
2. Core life cycle methods
In each Pipeline class, you must defineprocess_itemmethod, and the other three methods can be selected as needed. Their calling timing is as follows:
Used in conjunction with these four methods, you can finely control the opening and release of resources during the entire life cycle of the crawler, avoiding problems such as frequent connections being established or files not being closed.
Core Actual Combat Scenario
The following four Pipelines are the most common in crawler projects. You can copy them directly into the project according to your needs and use them with slight modifications.
Scenario 1: Data Cleaning Pipeline
Text, prices, and URLs on the Internet are often ridiculously dirty. For example, the title contains a bunch of line breaks, the price contains currency symbols and Chinese characters, and the URL is a relative path. This Pipeline is specially designed to help you clean them up:
Here we usedItemAdapter, which is compatible with dictionaries and Item objects, allowing Pipeline code not to be bound to specific data types.
Scenario 2: Required field verification + discarding unqualified items
Keeping news without titles and products without URLs will only waste storage space. This Pipeline intercepts data as soon as it enters the system:
⚠️ throws
DropItemAfterwards, the current Item will not continue to enter the subsequent Pipeline, and Scrapy will record an INFO level log to facilitate you to monitor the discard rate.
Scenario 3: MD5 deduplication Pipeline
Is the same page being crawled repeatedly? Generate MD5 fingerprints using core fields and discard duplicates:
This method is simple and effective, but if the amount of data is particularly large (millions), it is recommended toself.seenReplace with Redis Set to avoid memory overflow.
Scenario 4: JSON storage Pipeline
Finally, save the processed data as a standardized JSON file to facilitate handover to the data analysis team:
This Pipeline will write the results of each crawl into a JSON file with a timestamp, and automatically addscraped_atfield to facilitate tracking data collection time.
Performance and error optimization
When a crawler needs to process a large amount of data, inserting a single entry into the database can seriously slow down the speed. At the same time, unhandled exceptions may also cause the entire data to be lost. The following two optimization suggestions can make your Pipeline more robust.
1. Batch storage optimization (applicable to MySQL/MongoDB)
Each time you save a batch and submit it all at once, you can significantly reduce database connection overhead:
The core idea is: first store the Item in the memory buffer, and when the number reaches the set value or exceeds a certain time, perform a batch write. This will not only ensure that data is not lost, but also give full play to the throughput capacity of the database.
2. Basic error handling
Adding try/except to the Pipeline can prevent individual bad data from interrupting the entire process:
In this way, even if there is a problem with a single piece of data, it will not affect the normal flow of other data.
FAQ Quick Check
Q1: Pipeline is not effective?
Mostly due to these two reasons:
- Class name and path are inconsistent: Please confirm
settings.pyinITEM_PIPELINESThe key exactly matches the full path of your Pipeline class. - Uncaught exception occurred in the middle: Some Pipeline is in
process_itemAfter an exception is thrown, the entire chain of calls may be terminated. It is recommended to add exception-handling to key Pipelines.
Q2: Are there extra commas at the end of the JSON file?
Referring to the implementation in "Scenario 4" above, we useself.firstmark to control the addition of commas, so that the generated JSON array is legal and will not be troubled by an extra comma.
Q3: What should I do if the deduplication Pipeline overflows?
When the amount of data is large, use Python directlysetStoring fingerprints takes up a lot of memory. At this time, you can upgrade to Redis deduplication, andself.seenReplaced with a Redis Set collection. For distributed crawlers, global deduplication can also be achieved.
Q4: How to transfer data between multiple Pipelines?
Item is passed by reference. You can modify the Item in the previous Pipeline, and the subsequent Pipeline will automatically see the modified content. This is exactly in line with the concept of "clean first, then verify, and finally store".
💡 Core Points: The design of Pipeline should follow the "Single Responsibility" principle - each class only focuses on doing one thing, and is connected through priority to form a complete processing link. This is both clear and easy to read and convenient for later maintenance and expansion. Now, you can copy this code directly into your Scrapy project and have your data processing pipeline running in no time!

