A Complete Guide to Scrapy Data Cleaning and Verification
📂 Stage: Stage 2 - Data Flow (Data Processing) 🔗 Related chapters: Pipeline管道实战 · 数据去重与增量更新
**9 out of 10 raw data retrieved from web pages contain spaces, garbled characters, format errors or even duplicate content. ** If you throw it directly into the database, the analysis results will be biased at least, and the entire data warehouse will be polluted at worst. Scrapy provides Pipeline, a flexible pipeline mechanism that allows us to centrally complete cleaning, verification, and deduplication before data is stored in the database. This article will help you build a lightweight and efficient data production line, covering common cleaning scenarios for text, values, and dates, as well as core skills for required fields, ranges, and consistency verification. It also gives suggestions for performance optimization and pitfall avoidance, helping you easily solve 90% of data quality problems.
1. Core architecture of Pipeline data processing
In Scrapy, after the data is output by Spider, it will flow through the Pipeline class we configured in turn. In order to make the logic clear and maintainable, it is recommended to use different Pipeline classes to perform their respective duties in the order of "Cleaning → Verification → Deduplication":
By adjusting the numbers, you can control the order of data processing, and each Pipeline only does one thing, making later maintenance and expansion very convenient.
2. Basic practice of data cleaning
2.1 Text data cleaning - the most frequent dirty work
Text fields are usually the most prone to problems: leading and trailing whitespace, line breaks, tabs, HTML tag residue, entity encoding (e.g. ), Unicode confusion, and more. This one belowTextCleaningPipelineCan handle almost 80% of text cleaning needs:
Core idea:
- use
getattrDynamically obtain the text field marked in the Item, so that different Spiders can be specified flexibly. - Perform normalization, entity decoding, label removal, and merging of blanks in order, one after another, trying to restore clean plain text.
- Finally, the option "Remove non-essential characters" is reserved, which can be enabled on demand to prevent excessive cleaning.
2.2 Numeric and date data cleaning - from strings to standard types
To extract real numbers and dates from a bunch of strings with symbols and units, it is recommended to write a pipeline specifically for processing:
Benefits of doing this:
- the number becomes
intorfloat, the date becomesdatetime, subsequent writing to the database and data comparison will be very convenient. - Relative time processing allows expressions such as "just released" and "3 days ago" to be automatically converted into accurate times, greatly improving data timeliness.
- Fields that failed to be cleaned will be returned
None, the entire Item processing will not be interrupted because of an error.
3. Core actual combat of data verification
After the data is cleaned, you still need to ensure that it is "usable". The verification pipeline is responsible for checking required fields, value ranges, and logical consistency, and those that fail are directly discarded or marked.
A few practical principles:
- Only discard truly invalid data: For fields that can be saved (such as values out of range), warnings can be logged and corrected instead of directly
DropItem。 - Configuration through Item attribute: Different Spiders can define different
required_fieldsandrange_map, a set of Pipeline is reused in multiple places. - Logs are important:
spider.logger.warningIt allows you to quickly detect data anomalies and adjust cleaning strategies in a timely manner.
4. Performance optimization tips
Pipeline is the throat of data flow. Improper handling can easily become a performance bottleneck. Several simple and effective optimization methods:
-
Avoid I/O-intensive operations in Pipeline Do not directly check the database or send HTTP requests. If necessary, hand these operations over to the asynchronous thread pool, or put them outside the Pipeline and use a message queue to complete them asynchronously.
-
Prioritize the use of native string methods and introduce heavyweight libraries with caution Simple whitespace removal and replacement using Python string methods is the fastest; only use regular patterns for complex patterns; do not use lxml to parse the entire HTML unless absolutely necessary.
-
Carry out garbage collection in a timely manner exist
close_spiderCalled during or after every N Items are processedgc.collect(), which prevents long-running memory bloat. -
Replace indiscriminate traversal with field annotation Only clean and verify the fields you have declared to avoid performing repeated logic on each field in the Item, which not only improves performance but also reduces accidental injuries.
5. Guide to common pitfalls
-
Excessive cleaning, the more you wash, the dirtier you get Blindly using regular expressions to delete characters across the board may cut off normal content. It is recommended to output the abnormal values through logs first, and then decide how to deal with them.
-
Forgot time zone, date and time are confused If the target website involves multiple time zones, simply
datetimeNot enough, recommendedpytzor Python 3.9+ built-inzoneinfo, make sure the date has time zone information. -
Abuse of DropItem
DropItemWill interrupt the entire Pipeline chain, and only throw it out when the data is completely unavailable. When some fields are invalid, they can be assigned values asNoneOr mark it for downstream processing. -
Pipeline order is confusing It must be cleaned first and then verified, otherwise the error in the verification is likely to be due to insufficient cleaning, rather than a problem with the data itself.
-
Large text processing performance For items containing long text, try to use re.compile pre-compiled regular rules when cleaning to avoid
process_itemRepeat compilation within.
Related recommendations
- Pipeline管道实战 — Basics of data processing
- 数据去重与增量更新 — Data Management Strategy
- Downloader Middleware — request response processing

