🚀 A complete guide to Scrapy crawler framework

📂 Stage: Python crawler · Scrapy framework and distributed crawler 🔗 Related chapters: crawler-basics · Ajax分析和动态渲染页面爬取

📢 Must read at the beginning

This is an overview of the Scrapy ecosystem for all stages in 2026:

Newbies can follow the learning path step by step and pass the engineering project;
Veterans can jump directly to "Distributed Architecture", "Anti-crawling Confrontation" and "Containerized Deployment" to check for leaks and fix new ones;
At the end of the article, a complete index of technology stack selection and practical/advanced chapters is attached for easy access as needed.

学习路径
2026年最新技术栈
3分钟快速上手
Scrapy核心竞争力
企业级架构演进
章节索引与标签云

Learning Path

We break down the learning of Scrapy into six stages "From 0 to 1 Run → From 1 to N Optimization → From N to Cluster Distribution". Each stage uses a table to clarify the core content and importance.

The first stage: fledgling (core framework)

Understand the logic of asynchronous architecture, build a standardized project directory, and write the first crawler that can run.

Chapter number	Title link	Core content	Importance
01	Why choose Scrapy?	The core differences between synchronous vs asynchronous, Twisted’s underlying logic, and 2026 ecological positioning	⭐⭐⭐⭐⭐
02	Scrapy's five core components	Engine/Scheduler/Downloader/Spider/Pipeline collaboration process	⭐⭐⭐⭐⭐
03	Create the first project	`scrapy startproject`Command and directory structure analysis	⭐⭐⭐⭐
04	Spider practice	Writing crawling rules, parsing Response, yield data and requests	⭐⭐⭐⭐⭐
05	Selector	XPath/CSS advanced syntax, mixed usage techniques	⭐⭐⭐⭐⭐

The second stage: data flow (data processing)

Standardize the data structure and realize automated cleaning, verification, storage and multimedia association.

Chapter number	Title link	Core content	Importance
06	Item and ItemLoader	Define structured fields, batch data preprocessing	⭐⭐⭐⭐
07	Pipeline pipeline practice	JSON/CSV local storage, MySQL/MongoDB/Elasticsearch persistence	⭐⭐⭐⭐⭐
08	Images/FilesPipeline	Batch download of images/files, path customization, deduplication association	⭐⭐⭐⭐
09	Data cleaning and verification	Dirty data filtering, field legality verification, missing value processing	⭐⭐⭐⭐
10	Data deduplication and incremental update	Fingerprint generation, Redis local deduplication, timestamp incremental logic	⭐⭐⭐⭐⭐

The third stage: Offensive and defensive drills (middleware and anti-climbing)

Learn to disguise your identity, simulate human behavior, and deal with mainstream anti-crawling strategies.

Chapter number	Title link	Core content	Importance
11	Downloader Middleware	UA pool/Cookie pool management, request header disguise	⭐⭐⭐⭐⭐
12	Proxy IP pool integration	Free/paid proxy docking, dynamic switching logic, IP failure detection	⭐⭐⭐⭐⭐
13	Automatic speed limit AutoThrottle	Human behavior simulation, response delay adaptive adjustment	⭐⭐⭐⭐
14	Selenium/Playwright integration	Headless/headed browser docking, JS dynamic rendering processing, verification code preprocessing tips	⭐⭐⭐⭐⭐
15	Anti-crawling combat	Anti-detection browser (Playwright stealth), slider verification code (optional third-party API), cookie persistence	⭐⭐⭐⭐⭐

Phase 4: Practical Exercise (Project Development)

Completed two vertical e-commerce and social media monitoring projects in a real and complex environment.

Chapter number	Title link	Core content	Importance
16	Practical combat one: Vertical e-commerce full site crawling	Multi-level classification traversal, deep page turning, CrawlSpider rule customization	⭐⭐⭐⭐⭐
17	Practice 2: Social media monitoring	Keyword filtering, specified user/topic crawling, real-time incremental triggering	⭐⭐⭐⭐⭐
18-20	Increment/quality/performance optimization	Full-link exception-handling, log analysis, concurrency tuning, memory management	⭐⭐⭐⭐

The fifth stage: combat power upgrade (distributed and advanced)

Break through the single-machine bottleneck and achieve elastic capture of tens of millions of data.

Chapter number	Title link	Core content	Importance
21	Scrapy-Redis architecture	Shared request queue/deduplication collection, Master/Slave mode	⭐⭐⭐⭐⭐
22-25	Middleware/API/Deduplication/Optimization	Signal interception, Scrapyrt HTTP call, Bloom filter, bandwidth/IO optimization	⭐⭐⭐⭐

Stage Six: Operation, Maintenance and Monitoring (Engineering)

Standardized deployment environment, monitoring crawler health 24/7.

Chapter number	Title link	Core content	Importance
26-28	Deployment/containerization/monitoring	Scrapyd one-click deployment/logging, Docker/Docker Compose standardization, Prometheus+Grafana monitoring dashboard	⭐⭐⭐⭐⭐

The latest technology stack in 2026

We have compiled a tool set with production environment usage rate ≥ 70%, including core uses and alternatives.

Tools/Libraries	Core Usage	Alternatives	Enterprise Application Reference
Scrapy	Asynchronous crawler basic framework	PySpider (lightweight)	Zhihu, Meituan, Firefox crawler modules
Twisted	Underlying asynchronous network engine	asyncio native (Scrapy 3.x compatible version optional)	Scrapy official dependency
Playwright	JS dynamic rendering processing	Selenium (compatibility first)	ByteDance crawler test
Redis	Distributed queue/deduplication	RabbitMQ (complex task priority)	Standard configuration for mainstream distributed crawlers
MongoDB	Semi-structured data storage	Elasticsearch (full-text retrieval is preferred)	E-commerce/media data storage
ScrapydWeb	Crawler visual management	Gerapy (developed by Chinese)	Small, medium and micro enterprise production environment
Docker Compose	Single-machine cluster standardized deployment	K8s (large-scale cloud native)	Standardized development/testing/production environment

3 minutes to get started quickly

Run a minimalist version of Scrapy crawler and output JSON data.

Installation and initialization

# 1. 推荐使用虚拟环境安装
python -m venv scrapy_env
source scrapy_env/bin/activate   # Linux/Mac
scrapy_env\Scripts\activate      # Windows

# 2. 安装Scrapy
pip install scrapy

# 3. 创建项目与基础爬虫
scrapy startproject first_scrapy
cd first_scrapy
scrapy genspider quotes quotes.toscrape.com   # 免费测试网站

Modify crawler code

Openfirst_scrapy/spiders/quotes.py, replaced with the following:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        # 解析名言、作者、标签
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }
        # 翻页逻辑
        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

Run and save data

# 运行爬虫（输出JSON）
scrapy crawl quotes -o quotes.json

# 运行时自定义设置（如修改UA）
scrapy crawl quotes -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"

Scrapy Core Competencies

compared torequests + BeautifulSoupWith this "building block" combination, Scrapy's core advantage lies in "engineering, high performance, and scalability".

1. True asynchronous architecture

Based on the Twisted event loop, Scrapy can handle hundreds or thousands of requests at the same time, with a total time consumption of ≈ the time of the slowest request.

# 对比：同步积木式（低效，串行处理）
import requests
from bs4 import BeautifulSoup

urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 6)]
for url in urls:
    resp = requests.get(url)   # 阻塞等待上一个请求完成
    soup = BeautifulSoup(resp.text, "lxml")
    # 解析数据...
# 总耗时 = 5个请求的时间之和

2. Complete standardized pipeline

From request generation to data storage, Scrapy provides a clear chain of responsibility, with extensible hooks at every step.

flowchart LR
    A[Spider<br/>生成初始请求] --> B[Scheduler<br/>队列管理+去重]
    B --> C[Downloader Middleware<br/>UA/代理/请求头伪装]
    C --> D[Downloader<br/>发送请求]
    D --> E[Downloader Middleware<br/>响应预处理]
    E --> F[Spider Middleware<br/>响应拦截]
    F --> G[Spider<br/>解析数据/生成新请求]
    G --> H[Item Pipeline<br/>清洗/校验/存储]
    G --> B

3. Enterprise-level scalability

Through three mechanisms: middleware, extensions, and signal systems, Scrapy can easily adapt to various complex needs without modifying the framework source code.

Enterprise-level architecture evolution

The crawler architecture has gone through three stages from stand-alone to cloud-native.

The first stage: stand-alone architecture (small-scale data, <1 million items/day)

Suitable for lightweight data collection for personal projects and small, medium and micro enterprises.

┌─────────────────────────────────────────────────┐
│              单机 Scrapy 节点                    │
│ ┌──────────┐  ┌──────────┐  ┌──────────┐      │
│ │ Spider   │→ │Pipeline  │→ │DataStore │      │
│ │Middleware│  │(清洗存储)│  │(MySQL/Mongo)│  │
│ └──────────┘  └──────────┘  └──────────┘      │
└─────────────────────────────────────────────────┘

Phase 2: Redis distributed architecture (medium-scale data, <100 million items/day)

Suitable for full-site collection and aggregation of multiple data sources in vertical industries.

flowchart TB
    subgraph Redis集群
        A[请求队列<br/>FIFO/LIFO/Priority]
        B[去重集合<br/>Set/BloomFilter]
        C[爬虫状态<br/>抓取数/失败数/进度]
    end

    subgraph 爬虫集群
        D[Slave节点1<br/>Scrapy+Middleware]
        E[Slave节点2<br/>Scrapy+Middleware]
        F[Slave节点N<br/>Scrapy+Middleware]
    end

    subgraph 数据中心
        G[数据存储<br/>MongoDB/ES]
        H[监控告警<br/>ScrapydWeb/Grafana]
    end

    Redis集群 --> 爬虫集群
    爬虫集群 --> Redis集群
    爬虫集群 --> 数据中心

🔗 Quick jump

Scrapy 爬虫框架 分布式爬虫 反爬策略 数据抓取 爬虫中间件 Scrapy-Redis 爬虫部署 数据清洗 网络爬虫

#🚀 A complete guide to Scrapy crawler framework

#📢 Must read at the beginning

#Table of contents

#Learning Path

#The first stage: fledgling (core framework)

#The second stage: data flow (data processing)

#The third stage: Offensive and defensive drills (middleware and anti-climbing)

#Phase 4: Practical Exercise (Project Development)

#The fifth stage: combat power upgrade (distributed and advanced)

#Stage Six: Operation, Maintenance and Monitoring (Engineering)

#The latest technology stack in 2026

#3 minutes to get started quickly

#Installation and initialization

#Modify crawler code

#Run and save data

#Scrapy Core Competencies

#1. True asynchronous architecture

#2. Complete standardized pipeline

#3. Enterprise-level scalability

#Enterprise-level architecture evolution

#The first stage: stand-alone architecture (small-scale data, <1 million items/day)

#Phase 2: Redis distributed architecture (medium-scale data, <100 million items/day)

#Chapter index and tag cloud

#🔗 Quick jump

#🏷️ tag cloud