Data quality assurance: The "material" crawled out by Scrapy is clean and reliable
📂 Stage: Stage 4 - Practical Exercise (Project Development)
Many novices who have just started using Scrapy to write crawlers will always encounter the same scenario: after working hard to run 3,000 pieces of data, they open CSV or Excel and feel half-hearted - the title is half empty, the price field is filled with garbled strings, and after crawling for more than ten minutes, it suddenly gets stuck and can no longer move.
Dirty data and instability will directly drag down subsequent analysis, reports or front-end display. Therefore, from the first day of actual project implementation, two "standard parts" must be put into the crawler: data verification and exception-handling. One is responsible for being the "quality inspection workshop" and the other is responsible for being the "airbag". Next we implement them directly in Scrapy.
1. Data verification: Use Pipeline to do the first "workshop quality inspection"
Scrapy's Pipeline is a natural pipeline for processing data, and is very suitable for quality inspection work such as "screening qualified data, removing garbage, and preliminary cleaning".
Here is a copy that can be directly reused in e-commerce projectsValidationPipeline, the comments have been clearly written, and you can immediately copy it to the project for fine-tuning:
from scrapy.exceptions import DropItem
class ValidationPipeline:
"""
电商类数据通用验证 Pipeline
功能:
1. 必填字段缺失检查
2. 价格数据类型转换 + 合理性验证
3. URL 格式初步校验
"""
def process_item(self, item, spider):
# ---------- 第一步:必填字段缺失检查 ----------
# 电商通常必填项:产品唯一URL、标题、价格
required_fields = ["product_url", "product_title", "product_price"]
for field in required_fields:
# 既检查字段是否存在,也检查值是否真的“有意义”
if not item.get(field) or str(item.get(field)).strip() == "":
raise DropItem(
f"⚠️ 丢弃无效 item:{field} 缺失或为空 "
f"| 原 item(截取前100字符):{str(item)[:100]}..."
)
# ---------- 第二步:价格双重检查 ----------
try:
# 去掉人民币符号、千位分隔符等,再转成浮点数
clean_price = (
str(item["product_price"])
.replace("¥", "")
.replace("¥", "")
.replace(",", "")
.strip()
)
item["product_price"] = float(clean_price)
# 做一步合理性检查:价格不能是负数
if item["product_price"] < 0:
raise ValueError
except ValueError:
raise DropItem(
f"⚠️ 丢弃无效 item:product_price 格式不合理 "
f"| 原价格:{item['product_price']} "
f"| 关联 URL:{item['product_url']}"
)
# ---------- 第三步:URL 格式初步校验 ----------
# 只做最基础的:必须以 http:// 或 https:// 开头
if not item["product_url"].startswith(("http://", "https://")):
raise DropItem(
f"⚠️ 丢弃无效 item:product_url 不是有效 HTTP/HTTPS 链接 "
f"| 原链接:{item['product_url']}"
)
# 全部通过,交给下一个 Pipeline
return item
📝 Pipeline usage tips: Remember to go after writingsettings.pyIt is recommended that the priority be placed after the deduplication pipeline and before the warehousing pipeline.
# settings.py 中的配置片段
ITEM_PIPELINES = {
# 假设你已有去重 Pipeline,优先级为 100(数字越小优先级越高)
# "myproject.pipelines.DuplicatesPipeline": 100,
"myproject.pipelines.ValidationPipeline": 200, # 打开质检,优先级 200
# "myproject.pipelines.SaveToMySQLPipeline": 300,
}
2. exception-handling: Use Middleware as an "airbag" to save you even if it collapses
Pipeline manages "data qualification", while Middleware manages "how long the crawler can run stably" - specifically dealing with various accidents during network requests and responses: timeouts, 404, 500 server crashes, etc.
The following is a retry/give up based on the situationErrorHandlingMiddleware, which allows the crawler to save itself first when it encounters bumps, as if it has an air bag, and then stop and record the reason if it fails:
import logging
from scrapy import Request
from scrapy.exceptions import IgnoreRequest
class ErrorHandlingMiddleware:
"""
Scrapy 网络请求exception-handling Middleware
功能:
1. 对超时、5xx 临时错误自动重试(除 500 外,因为 500 常代表服务端内部逻辑错误)
2. 对 4xx 客户端错误直接放弃(404/403 等重试也没用)
3. 所有错误都用详细日志记录,方便后续排查
"""
# 允许自动重试的状态码
ALLOWED_RETRY_STATUS_CODES = [502, 503, 504]
# 允许自动重试的异常类型
ALLOWED_RETRY_EXCEPTIONS = [
TimeoutError, # 请求超时
ConnectionRefusedError, # 连接被拒绝
]
# 独立设置最大重试次数(可与 settings.py 的 RETRY_TIMES 配合)
MAX_RETRY_TIMES = 3
def process_response(self, request, response, spider):
"""
处理正常返回但状态码不对的响应
"""
if response.status in self.ALLOWED_RETRY_STATUS_CODES:
retry_times = request.meta.get("retry_times", 0)
if retry_times < self.MAX_RETRY_TIMES:
retry_times += 1
spider.logger.warning(
f"🔄 状态码 {response.status} 触发重试:第 {retry_times} 次 "
f"| URL:{request.url}"
)
new_request = request.copy()
new_request.meta["retry_times"] = retry_times
# dont_filter=True 防止被去重中间件拦截
new_request.dont_filter = True
return new_request
else:
spider.logger.error(
f"❌ 放弃重试(超过 {self.MAX_RETRY_TIMES} 次):"
f"状态码 {response.status} | URL:{request.url}"
)
raise IgnoreRequest
# 4xx 客户端错误直接放弃
elif response.status >= 400:
spider.logger.warning(
f"🚫 放弃请求(客户端/不可恢复服务端错误):"
f"状态码 {response.status} | URL:{request.url}"
)
raise IgnoreRequest
# 状态码正常,直接返回给 Spider
return response
def process_exception(self, request, exception, spider):
"""
处理请求过程中直接抛出的异常
"""
if isinstance(exception, tuple(self.ALLOWED_RETRY_EXCEPTIONS)):
retry_times = request.meta.get("retry_times", 0)
if retry_times < self.MAX_RETRY_TIMES:
retry_times += 1
spider.logger.warning(
f"🔄 异常触发重试:{type(exception).__name__} "
f"| 第 {retry_times} 次 | URL:{request.url}"
)
new_request = request.copy()
new_request.meta["retry_times"] = retry_times
new_request.dont_filter = True
return new_request
else:
spider.logger.error(
f"❌ 放弃重试(超过 {self.MAX_RETRY_TIMES} 次):"
f"{type(exception).__name__} | URL:{request.url}"
)
return None
# 其他未知异常直接记录并放弃
spider.logger.error(
f"🚫 放弃请求(未知异常):{type(exception).__name__} "
f"| 详情:{str(exception)} | URL:{request.url}"
)
return None
hang it tosettings.pyofDOWNLOADER_MIDDLEWARES, it will take effect automatically.
3. Summary: "Golden 3 Steps" for Quality Assurance
Regardless of the size of the crawler project, the core logic of quality assurance is actually these three steps, but different modules are selected when implementing it with Scrapy:
💡 Remember a sentence: Data quality determines the quality of analysis. Use 10% of the time to do verification and exception-handling in advance, and subsequent data analysis, monitoring, and maintenance costs will be reduced exponentially. Don’t wait for the data to become dirty before going back to clean it, stop it directly at the source.
🔗 Extended reading