🚀 A complete guide to Scrapy crawler framework

📂 Stage: Python crawler · Scrapy framework and distributed crawler 🔗 Related chapters: crawler-basics · Ajax分析和动态渲染页面爬取


📢 Must read at the beginning

This is an overview of the Scrapy ecosystem for all stages in 2026:

  • Newbies can follow the learning path step by step and pass the engineering project;
  • Veterans can jump directly to "Distributed Architecture", "Anti-crawling Confrontation" and "Containerized Deployment" to check for leaks and fix new ones;
  • At the end of the article, a complete index of technology stack selection and practical/advanced chapters is attached for easy access as needed.

Table of contents


Learning Path

We break down the learning of Scrapy into six stages "From 0 to 1 Run → From 1 to N Optimization → From N to Cluster Distribution". Each stage uses a table to clarify the core content and importance.

The first stage: fledgling (core framework)

Understand the logic of asynchronous architecture, build a standardized project directory, and write the first crawler that can run.

Chapter numberTitle linkCore contentImportance
01Why choose Scrapy?The core differences between synchronous vs asynchronous, Twisted’s underlying logic, and 2026 ecological positioning⭐⭐⭐⭐⭐
02Scrapy's five core componentsEngine/Scheduler/Downloader/Spider/Pipeline collaboration process⭐⭐⭐⭐⭐
03Create the first projectscrapy startprojectCommand and directory structure analysis⭐⭐⭐⭐
04Spider practiceWriting crawling rules, parsing Response, yield data and requests⭐⭐⭐⭐⭐
05SelectorXPath/CSS advanced syntax, mixed usage techniques⭐⭐⭐⭐⭐

The second stage: data flow (data processing)

Standardize the data structure and realize automated cleaning, verification, storage and multimedia association.

Chapter numberTitle linkCore contentImportance
06Item and ItemLoaderDefine structured fields, batch data preprocessing⭐⭐⭐⭐
07Pipeline pipeline practiceJSON/CSV local storage, MySQL/MongoDB/Elasticsearch persistence⭐⭐⭐⭐⭐
08Images/FilesPipelineBatch download of images/files, path customization, deduplication association⭐⭐⭐⭐
09Data cleaning and verificationDirty data filtering, field legality verification, missing value processing⭐⭐⭐⭐
10Data deduplication and incremental updateFingerprint generation, Redis local deduplication, timestamp incremental logic⭐⭐⭐⭐⭐

The third stage: Offensive and defensive drills (middleware and anti-climbing)

Learn to disguise your identity, simulate human behavior, and deal with mainstream anti-crawling strategies.

Chapter numberTitle linkCore contentImportance
11Downloader MiddlewareUA pool/Cookie pool management, request header disguise⭐⭐⭐⭐⭐
12Proxy IP pool integrationFree/paid proxy docking, dynamic switching logic, IP failure detection⭐⭐⭐⭐⭐
13Automatic speed limit AutoThrottleHuman behavior simulation, response delay adaptive adjustment⭐⭐⭐⭐
14Selenium/Playwright integrationHeadless/headed browser docking, JS dynamic rendering processing, verification code preprocessing tips⭐⭐⭐⭐⭐
15Anti-crawling combatAnti-detection browser (Playwright stealth), slider verification code (optional third-party API), cookie persistence⭐⭐⭐⭐⭐

Phase 4: Practical Exercise (Project Development)

Completed two vertical e-commerce and social media monitoring projects in a real and complex environment.

Chapter numberTitle linkCore contentImportance
16Practical combat one: Vertical e-commerce full site crawlingMulti-level classification traversal, deep page turning, CrawlSpider rule customization⭐⭐⭐⭐⭐
17Practice 2: Social media monitoringKeyword filtering, specified user/topic crawling, real-time incremental triggering⭐⭐⭐⭐⭐
18-20Increment/quality/performance optimizationFull-link exception-handling, log analysis, concurrency tuning, memory management⭐⭐⭐⭐

The fifth stage: combat power upgrade (distributed and advanced)

Break through the single-machine bottleneck and achieve elastic capture of tens of millions of data.

Chapter numberTitle linkCore contentImportance
21Scrapy-Redis architectureShared request queue/deduplication collection, Master/Slave mode⭐⭐⭐⭐⭐
22-25Middleware/API/Deduplication/OptimizationSignal interception, Scrapyrt HTTP call, Bloom filter, bandwidth/IO optimization⭐⭐⭐⭐

Stage Six: Operation, Maintenance and Monitoring (Engineering)

Standardized deployment environment, monitoring crawler health 24/7.

Chapter numberTitle linkCore contentImportance
26-28Deployment/containerization/monitoringScrapyd one-click deployment/logging, Docker/Docker Compose standardization, Prometheus+Grafana monitoring dashboard⭐⭐⭐⭐⭐

The latest technology stack in 2026

We have compiled a tool set with production environment usage rate ≥ 70%, including core uses and alternatives.

Tools/LibrariesCore UsageAlternativesEnterprise Application Reference
ScrapyAsynchronous crawler basic frameworkPySpider (lightweight)Zhihu, Meituan, Firefox crawler modules
TwistedUnderlying asynchronous network engineasyncio native (Scrapy 3.x compatible version optional)Scrapy official dependency
PlaywrightJS dynamic rendering processingSelenium (compatibility first)ByteDance crawler test
RedisDistributed queue/deduplicationRabbitMQ (complex task priority)Standard configuration for mainstream distributed crawlers
MongoDBSemi-structured data storageElasticsearch (full-text retrieval is preferred)E-commerce/media data storage
ScrapydWebCrawler visual managementGerapy (developed by Chinese)Small, medium and micro enterprise production environment
Docker ComposeSingle-machine cluster standardized deploymentK8s (large-scale cloud native)Standardized development/testing/production environment

3 minutes to get started quickly

Run a minimalist version of Scrapy crawler and output JSON data.

Installation and initialization

# 1. 推荐使用虚拟环境安装
python -m venv scrapy_env
source scrapy_env/bin/activate   # Linux/Mac
scrapy_env\Scripts\activate      # Windows

# 2. 安装Scrapy
pip install scrapy

# 3. 创建项目与基础爬虫
scrapy startproject first_scrapy
cd first_scrapy
scrapy genspider quotes quotes.toscrape.com   # 免费测试网站

Modify crawler code

Openfirst_scrapy/spiders/quotes.py, replaced with the following:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        # 解析名言、作者、标签
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }
        # 翻页逻辑
        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

Run and save data

# 运行爬虫(输出JSON)
scrapy crawl quotes -o quotes.json

# 运行时自定义设置(如修改UA)
scrapy crawl quotes -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"

Scrapy Core Competencies

compared torequests + BeautifulSoupWith this "building block" combination, Scrapy's core advantage lies in "engineering, high performance, and scalability".

1. True asynchronous architecture

Based on the Twisted event loop, Scrapy can handle hundreds or thousands of requests at the same time, with a total time consumption of ≈ the time of the slowest request.

# 对比:同步积木式(低效,串行处理)
import requests
from bs4 import BeautifulSoup

urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 6)]
for url in urls:
    resp = requests.get(url)   # 阻塞等待上一个请求完成
    soup = BeautifulSoup(resp.text, "lxml")
    # 解析数据...
# 总耗时 = 5个请求的时间之和

2. Complete standardized pipeline

From request generation to data storage, Scrapy provides a clear chain of responsibility, with extensible hooks at every step.

flowchart LR
    A[Spider<br/>生成初始请求] --> B[Scheduler<br/>队列管理+去重]
    B --> C[Downloader Middleware<br/>UA/代理/请求头伪装]
    C --> D[Downloader<br/>发送请求]
    D --> E[Downloader Middleware<br/>响应预处理]
    E --> F[Spider Middleware<br/>响应拦截]
    F --> G[Spider<br/>解析数据/生成新请求]
    G --> H[Item Pipeline<br/>清洗/校验/存储]
    G --> B

3. Enterprise-level scalability

Through three mechanisms: middleware, extensions, and signal systems, Scrapy can easily adapt to various complex needs without modifying the framework source code.


Enterprise-level architecture evolution

The crawler architecture has gone through three stages from stand-alone to cloud-native.

The first stage: stand-alone architecture (small-scale data, <1 million items/day)

Suitable for lightweight data collection for personal projects and small, medium and micro enterprises.

┌─────────────────────────────────────────────────┐
│              单机 Scrapy 节点                    │
│ ┌──────────┐  ┌──────────┐  ┌──────────┐      │
│ │ Spider   │→ │Pipeline  │→ │DataStore │      │
│ │Middleware│  │(清洗存储)│  │(MySQL/Mongo)│  │
│ └──────────┘  └──────────┘  └──────────┘      │
└─────────────────────────────────────────────────┘

Phase 2: Redis distributed architecture (medium-scale data, <100 million items/day)

Suitable for full-site collection and aggregation of multiple data sources in vertical industries.

flowchart TB
    subgraph Redis集群
        A[请求队列<br/>FIFO/LIFO/Priority]
        B[去重集合<br/>Set/BloomFilter]
        C[爬虫状态<br/>抓取数/失败数/进度]
    end

    subgraph 爬虫集群
        D[Slave节点1<br/>Scrapy+Middleware]
        E[Slave节点2<br/>Scrapy+Middleware]
        F[Slave节点N<br/>Scrapy+Middleware]
    end

    subgraph 数据中心
        G[数据存储<br/>MongoDB/ES]
        H[监控告警<br/>ScrapydWeb/Grafana]
    end

    Redis集群 --> 爬虫集群
    爬虫集群 --> Redis集群
    爬虫集群 --> 数据中心

Chapter index and tag cloud

🔗 Quick jump

🏷️ tag cloud

Scrapy 爬虫框架 分布式爬虫 反爬策略 数据抓取 爬虫中间件 Scrapy-Redis 爬虫部署 数据清洗 网络爬虫