A Complete Guide to Scrapy Data Cleaning and Verification

📂 Stage: Stage 2 - Data Flow (Data Processing) 🔗 Related chapters: Pipeline管道实战 · 数据去重与增量更新

**9 out of 10 raw data retrieved from web pages contain spaces, garbled characters, format errors or even duplicate content. ** If you throw it directly into the database, the analysis results will be biased at least, and the entire data warehouse will be polluted at worst. Scrapy provides Pipeline, a flexible pipeline mechanism that allows us to centrally complete cleaning, verification, and deduplication before data is stored in the database. This article will help you build a lightweight and efficient data production line, covering common cleaning scenarios for text, values, and dates, as well as core skills for required fields, ranges, and consistency verification. It also gives suggestions for performance optimization and pitfall avoidance, helping you easily solve 90% of data quality problems.


1. Core architecture of Pipeline data processing

In Scrapy, after the data is output by Spider, it will flow through the Pipeline class we configured in turn. In order to make the logic clear and maintainable, it is recommended to use different Pipeline classes to perform their respective duties in the order of "Cleaning → Verification → Deduplication":

# settings.py 配置Pipeline优先级(数字越小优先级越高)
ITEM_PIPELINES = {
    'myproject.pipelines.TextCleaningPipeline': 100,    # 1. 文本清洗
    'myproject.pipelines.NumericDateTimePipeline': 150, # 2. 数值/日期处理
    'myproject.pipelines.CoreValidationPipeline': 200,  # 3. 核心校验
    # 'myproject.pipelines.DuplicatesPipeline': 250,    # 4.(可选)去重
}

By adjusting the numbers, you can control the order of data processing, and each Pipeline only does one thing, making later maintenance and expansion very convenient.


2. Basic practice of data cleaning

2.1 Text data cleaning - the most frequent dirty work

Text fields are usually the most prone to problems: leading and trailing whitespace, line breaks, tabs, HTML tag residue, entity encoding (e.g. ), Unicode confusion, and more. This one belowTextCleaningPipelineCan handle almost 80% of text cleaning needs:

import re
import unicodedata
from html import unescape

class TextCleaningPipeline:
    def process_item(self, item, spider):
        # 只清洗声明过的文本字段,避免误改其他字段
        text_fields = getattr(item, 'text_fields', ['title', 'content', 'desc'])

        for field in text_fields:
            if field not in item or not isinstance(item[field], str):
                continue

            raw = item[field]
            # 1. Unicode 标准化(解决全角/半角混用、乱码)
            raw = unicodedata.normalize('NFKC', raw)
            # 2. 解码 HTML 实体(&nbsp; → 空格、&lt; → <)
            raw = unescape(raw)
            # 3. 去除 HTML 标签(轻量场景用正则即可)
            raw = re.sub(r'<[^>]+>', '', raw)
            # 4. 压缩所有空白字符(换行/制表符统一替换为空格)
            raw = re.sub(r'\s+', ' ', raw).strip()

            # 5. 可选:只保留中文、英文、数字和常用标点
            # raw = re.sub(r'[^\w\s\u4e00-\u9fff.,!?;:()""''\[\]{}\-—]', '', raw)

            item[field] = raw
        return item

Core idea:

  • usegetattrDynamically obtain the text field marked in the Item, so that different Spiders can be specified flexibly.
  • Perform normalization, entity decoding, label removal, and merging of blanks in order, one after another, trying to restore clean plain text.
  • Finally, the option "Remove non-essential characters" is reserved, which can be enabled on demand to prevent excessive cleaning.

2.2 Numeric and date data cleaning - from strings to standard types

To extract real numbers and dates from a bunch of strings with symbols and units, it is recommended to write a pipeline specifically for processing:

from datetime import datetime, timedelta
import re

class NumericDateTimePipeline:
    def __init__(self):
        # 常见日期格式(可按网站实际调整)
        self.date_fmts = ['%Y-%m-%d', '%Y/%m/%d', '%d-%m-%Y', '%b %d, %Y']
        # 相对时间正则("3天前"、"2 hours ago" 等)
        self.relative_re = [
            (r'(\d+)\s*天前', lambda m: datetime.now() - timedelta(days=int(m.group(1)))),
            (r'(\d+)\s*小时前', lambda m: datetime.now() - timedelta(hours=int(m.group(1)))),
            (r'(\d+)\s*days?\s*ago', lambda m: datetime.now() - timedelta(days=int(m.group(1)))),
        ]

    def process_item(self, item, spider):
        num_fields = getattr(item, 'num_fields', ['price', 'rating', 'stock'])
        date_fields = getattr(item, 'date_fields', ['pub_time', 'update_time'])

        for field in num_fields:
            if field not in item:
                continue
            item[field] = self._clean_num(item[field])

        for field in date_fields:
            if field not in item:
                continue
            item[field] = self._clean_date(item[field])

        return item

    def _clean_num(self, raw):
        if isinstance(raw, (int, float)):
            return raw
        if not isinstance(raw, str):
            return None

        # 去除货币符号、单位、"元"、"评分"等干扰字符
        raw = re.sub(r'[¥$,€£₹%\s元个评分]', '', raw)
        try:
            return float(raw) if '.' in raw else int(raw)
        except ValueError:
            return None

    def _clean_date(self, raw):
        if isinstance(raw, datetime):
            return raw
        if not isinstance(raw, str):
            return None

        # 先匹配相对时间(例如 "5天前")
        for pattern, func in self.relative_re:
            match = re.search(pattern, raw, re.IGNORECASE)
            if match:
                return func(match)

        # 再尝试固定格式
        for fmt in self.date_fmts:
            try:
                return datetime.strptime(raw.strip(), fmt)
            except ValueError:
                continue
        return None

Benefits of doing this:

  • the number becomesintorfloat, the date becomesdatetime, subsequent writing to the database and data comparison will be very convenient.
  • Relative time processing allows expressions such as "just released" and "3 days ago" to be automatically converted into accurate times, greatly improving data timeliness.
  • Fields that failed to be cleaned will be returnedNone, the entire Item processing will not be interrupted because of an error.

3. Core actual combat of data verification

After the data is cleaned, you still need to ensure that it is "usable". The verification pipeline is responsible for checking required fields, value ranges, and logical consistency, and those that fail are directly discarded or marked.

from scrapy.exceptions import DropItem

class CoreValidationPipeline:
    def process_item(self, item, spider):
        # 1. 必填字段检查
        required = getattr(item, 'required_fields', ['title', 'price', 'pub_time'])
        missing = [f for f in required if f not in item or item[f] is None]
        if missing:
            raise DropItem(f"Missing required fields: {', '.join(missing)}")

        # 2. 数值范围检查
        range_map = getattr(item, 'range_map', {'price': (0, 1000000), 'rating': (0, 5)})
        for f, (min_v, max_v) in range_map.items():
            if f not in item or not isinstance(item[f], (int, float)):
                continue
            if not (min_v <= item[f] <= max_v):
                spider.logger.warning(f"Field {f} out of range: {item[f]}")
                # 可选:直接丢弃,或者将其限制在边界值
                # item[f] = max(min_v, min(item[f], max_v))

        # 3. 简单一致性检查(如折扣价不能高于原价)
        if 'original_price' in item and 'discount_price' in item:
            if item['original_price'] and item['discount_price'] > item['original_price']:
                spider.logger.warning(
                    f"Discount > original: {item['discount_price']} > {item['original_price']}"
                )
                # 根据业务需要,可以选择 DropItem 或交换两个值

        return item

A few practical principles:

  • Only discard truly invalid data: For fields that can be saved (such as values ​​out of range), warnings can be logged and corrected instead of directlyDropItem
  • Configuration through Item attribute: Different Spiders can define differentrequired_fieldsandrange_map, a set of Pipeline is reused in multiple places.
  • Logs are important:spider.logger.warningIt allows you to quickly detect data anomalies and adjust cleaning strategies in a timely manner.

4. Performance optimization tips

Pipeline is the throat of data flow. Improper handling can easily become a performance bottleneck. Several simple and effective optimization methods:

  1. Avoid I/O-intensive operations in Pipeline Do not directly check the database or send HTTP requests. If necessary, hand these operations over to the asynchronous thread pool, or put them outside the Pipeline and use a message queue to complete them asynchronously.

  2. Prioritize the use of native string methods and introduce heavyweight libraries with caution Simple whitespace removal and replacement using Python string methods is the fastest; only use regular patterns for complex patterns; do not use lxml to parse the entire HTML unless absolutely necessary.

  3. Carry out garbage collection in a timely manner existclose_spiderCalled during or after every N Items are processedgc.collect(), which prevents long-running memory bloat.

  4. Replace indiscriminate traversal with field annotation Only clean and verify the fields you have declared to avoid performing repeated logic on each field in the Item, which not only improves performance but also reduces accidental injuries.


5. Guide to common pitfalls

  1. Excessive cleaning, the more you wash, the dirtier you get Blindly using regular expressions to delete characters across the board may cut off normal content. It is recommended to output the abnormal values ​​through logs first, and then decide how to deal with them.

  2. Forgot time zone, date and time are confused If the target website involves multiple time zones, simplydatetimeNot enough, recommendedpytzor Python 3.9+ built-inzoneinfo, make sure the date has time zone information.

  3. Abuse of DropItem DropItemWill interrupt the entire Pipeline chain, and only throw it out when the data is completely unavailable. When some fields are invalid, they can be assigned values ​​asNoneOr mark it for downstream processing.

  4. Pipeline order is confusing It must be cleaned first and then verified, otherwise the error in the verification is likely to be due to insufficient cleaning, rather than a problem with the data itself.

  5. Large text processing performance For items containing long text, try to use re.compile pre-compiled regular rules when cleaning to avoidprocess_itemRepeat compilation within.


Related recommendations