title: Detailed explanation of Scrapy's five core components - Engine, Scheduler, Downloader, Spiders, Pipeline in-depth analysis | Daoman PythonAI description: Deeply understand the architectural design, working principles and collaborative working mechanism of Scrapy's five core components: Engine, Scheduler, Downloader, Spiders, and Pipeline. Master the underlying architecture of the Scrapy crawler framework. keywords: [Scrapy, core components, Engine, Scheduler, Downloader, Spiders, Pipeline, crawler architecture, crawler framework] date: 2026-04-10 updated: 2024-01-15 author: DaomanPythonAI tags: [Scrapy, crawler architecture, core components, crawler framework]

Detailed explanation of the five core components of Scrapy - How do the five internal organs work together?

Hello everyone, I am Daoman PythonAI! In the last article, we understood 为什么选Scrapy (asynchronous, high concurrency, and strong scalability). Today we will take it apart to learn directly - 5-piece core set is not mysterious. After reading it, you will understand how Scrapy can crawl data "fast, steadily, and skillfully"!

📂 Stage: Stage 1 - fledgling (core framework) 🔗 Related chapters: 为什么选择 Scrapy? · 创建你的首个工程


Table of contents


Look at the global architecture in 1 minute

Scrapy adopts event-driven asynchronous architecture, with 5 core components each performing their own duties. Engine acts as the "general commander" to connect all links in series, automatically forming a data pipeline:

flowchart LR
    A[Spider<br>解析逻辑师] -->|Requests| B[Engine<br>核心指挥家]
    B -->|排队Requests| C[Scheduler<br>任务调度管家]
    C -->|取优先级Requests| B
    B -->|发Requests| D[Downloader<br>内容搬运工]
    D -->|HTTP请求| E[目标网站]
    E -->|HTTP响应| D
    D -->|Responses| B
    B -->|Responses| A
    A -->|新Requests/Items| B
    B -->|Items| F[Pipeline<br>数据加工厂]
    F --> G[(数据存储)]

Minimalist data flow direction

Spider(要爬的URL / 解析出的数据)→ Engine(分配、协调)→ 目标组件

You only need to write "what to parse and how to link to new ones" in Spider, and the rest of the scheduling, downloading, and storage are all taken care of by the framework.


Engine (core conductor)

Engine is the "brain" of Scrapy. It does not do specific crawling work, but determines "when, what to do, and who to hand over" to all components.

Core Responsibilities

  1. Event loop driver: Use Twisted asynchronous library to take over the execution rhythm of the entire process
  2. Component life cycle management: start, run, pause, and stop the crawler
  3. Signal Distribution and Processing: Coordinate status notifications between components
  4. Data flow monitoring: Ensure that Requests/Responses/Items are correctly transferred between components

Common life cycle signals

# 在中间件或爬虫中常用的信号(不需要写太多代码,注册即可)
from scrapy import signals

# spider_opened:爬虫刚启动(可用来初始化数据库连接)
# spider_closed:爬虫停止前(可关闭连接、打印统计信息)
# response_received:刚拿到 HTTP 响应(适合加一些通用监控)
# item_scraped:Item 经过所有 Pipeline 成功生成后触发

Configuration Notes

# settings.py 中和 Engine 相关的核心开关
"""
# 爬虫空闲多久后自动停止(0 表示不自动停,生产环境建议设)
CLOSESPIDER_TIMEOUT = 3600      # 1小时后自动停止

# 爬够指定页数后自动停止
CLOSESPIDER_PAGECOUNT = 1000

# 爬够指定条数后自动停止
CLOSESPIDER_ITEMCOUNT = 10000
"""

These parameters can help you control the crawling scale and resource consumption to avoid infinite running.


Scheduler (Task Scheduling Manager)

Scheduler is a "warehouse administrator + priority manager", which specializes in managing the Requests queue to be crawled, and achieves the three major functions of removing duplication, queuing, and continuing to crawl.

Core functions

  1. Remove: Used by defaultRFPDupeFilterGenerate request fingerprint (hash value) to determine whether it has been crawled
  2. Queue: Supports FIFO / LIFO / priority queue, first in first out by default
  3. Persistence: SettingsJOBDIRFinally, the interrupted task can be resumed from the breakpoint!

Configuration and techniques

# settings.py 里的调度相关配置
"""
# 去重器(想自定义去重逻辑?改这个类)
DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'

# 持久化续爬目录(生成后别随便删,中断后重启命令就能续爬)
JOBDIR = 'crawls/my_spider_20260410/'

# 优先级队列(数字越大优先级越高,默认0)
# 比如可以把详情页优先级设为5,列表页设为1
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'
"""

Customized deduplication (simple version)

# 只需要继承默认去重器,重写指纹生成方法
# 例如:忽略 URL 中的时间戳参数,避免重复爬同一个页面
from scrapy.dupefilters import RFPDupeFilter
from scrapy.utils.request import request_fingerprint

class IgnoreTimestampDupeFilter(RFPDupeFilter):
    def request_fingerprint(self, request):
        # 拷贝一个 request,移除 URL 里的查询参数(如时间戳)
        new_request = request.replace(url=request.url.split('?')[0])
        return request_fingerprint(new_request)

Downloader (content porter)

Downloader is a "high-speed courier", responsible for sending HTTP requests, receiving responses, and steadily moving web pages back by relying on connection reuse, automatic speed limiting, and downloader middleware.

Core subsystem

  1. Downloader Middleware: Intercepts Requests/Responses, can modify User-Agent, set proxy, handle redirects and retries
  2. Connection Pool: Reuse TCP connections and reduce three-way handshake overhead
  3. AutoThrottle: Dynamically adjust the request delay according to the response speed of the target website

Production-level configuration

# settings.py 中的反爬与性能核心配置
"""
# 并发控制(核心!别一次开太高,容易被封)
CONCURRENT_REQUESTS = 16               # 全局并发请求数
CONCURRENT_REQUESTS_PER_DOMAIN = 4     # 单域名并发数(建议2~8)

# 延迟控制(配合自动限速更稳)
DOWNLOAD_DELAY = 1                     # 基础固定延迟1秒
RANDOMIZE_DOWNLOAD_DELAY = 0.5         # 在±50%范围内随机浮动
AUTOTHROTTLE_ENABLED = True            # 强烈建议开启自动限速
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0  # 目标域名并发利用率
AUTOTHROTTLE_MAX_DELAY = 10            # 最大延迟不超过10秒

# 重试控制
RETRY_TIMES = 3                        # 最多重试3次
RETRY_HTTP_CODES = [500, 502, 503, 408, 429]  # 仅重试这些状态码

# 超时控制
DOWNLOAD_TIMEOUT = 180                 # 单次请求最长等待180秒
"""

Spiders (Analytical Logician)

Spiders is the only place where you need to write business logic! Define the starting URL here, parse the HTTP response, and produce new requests and Item data.

The two most commonly used Spiders

  1. scrapy.Spider: Flexible and lightweight, suitable for customized extraction logic
  2. CrawlSpider:CooperateRule + LinkExtractorCan automatically discover links, suitable for crawling the entire site

Basic Spider simplest example

import scrapy

class DoubanTop250Spider(scrapy.Spider):
    name = "douban_top250"                  # 必须唯一!运行爬虫时用到的名字
    allowed_domains = ["movie.douban.com"]  # 限制只爬取该域名下的页面
    start_urls = ["https://movie.douban.com/top250?start=0"]

    def parse(self, response):
        # 解析当前页的电影信息
        for movie in response.css("ol.grid_view li"):
            yield {
                "title": movie.css("span.title::text").get(),
                "rating": movie.css("span.rating_num::text").get(),
                "quote": movie.css("span.inq::text").get(),
            }

        # 提取下一页链接,follow() 会自动拼接相对 URL 并回调 self.parse
        next_page = response.css("div.paginator a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Key points:yieldThe new request is handed over to the Engine to continue scheduling.yieldThe dictionary will be converted into Item and passed through the Pipeline.


Pipeline (data processing factory)

Pipeline is a "quality inspection + warehousing pipeline", which cleans, verifies, and deduplicates the Items produced by Spider, and then safely saves them to a database or file.

Pipeline life cycle

Each Pipeline must implement at least one of the following two methods (more are optional):

  1. open_spider(self, spider): Executed when the crawler starts (such as initializing the MySQL connection)
  2. process_item(self, item, spider):**Must write! ** Process each Item and return the Item (or throwDropItemthrow away)
  3. close_spider(self, spider): Executed when the crawler stops (such as closing the connection)

Execution sequence configuration

# settings.py 中按照数字从小到大依次执行(先验证,再清洗,最后存储)
ITEM_PIPELINES = {
    'myproject.pipelines.ValidationPipeline': 300,
    'myproject.pipelines.CleaningPipeline': 400,
    'myproject.pipelines.MongoDBPipeline': 500,
}

Complete Collaborative Workflow (Must See!)

7 steps to walk through the entire process

  1. Initialization Start the crawler → Engine loads Spider → Scheduler initializes the queue → Downloader prepares the connection pool

  2. Initiate initial request Spider generationstart_urls→ Engine forwards to Scheduler → Scheduler removes duplicates and then joins the queue

  3. Get request Engine takes the highest priority Request from Scheduler → forwards it to Downloader

  4. Download Downloader sends HTTP request → gets response → hands it to Engine

  5. Analysis Engine sends the response to Spider correspondingcallback→ Spider parses out new Requests and Items

  6. Looping and Storage

  • New Requests → Engine → Scheduler → Repeat steps 3‑5
  • Items → Engine → Execute all Pipelines in sequence → Save to the target location
  1. Stop The Scheduler queue is empty and no new Request is generated → Engine shuts down all components → The crawler ends

Optimization Tips

Performance optimization

  1. Set concurrency reasonably: global16~64, single domain name2~8, not too high to avoid being blocked
  2. Automatic speed limit must be enabled:AUTOTHROTTLE_ENABLED = True, let the framework adapt to
  3. Reuse middleware logic: Don’t pile all the anti-crawling code in Spider. Extracting middleware is easier to maintain.

Memory optimization

  1. Discard invalid items in time: If the conditions are not met in the Pipeline,raise DropItem
  2. Limit crawling scale: useCLOSESPIDER_TIMEOUTCLOSESPIDER_PAGECOUNTCLOSESPIDER_ITEMCOUNTcontrol

FAQ Quick Answer

Q1: How does Spider pass parameters to Pipeline?

A: You can use it directlyspider.settings.get(), or define custom properties in Spider (such asspider.custom_param), passed in PipelinespiderObject access.

Q2: How to resume the interrupted task?

A: Insettings.pyMedium settingsJOBDIR = 'crawls/xxx/', the crawler will save the intermediate state after it is started. After interruption, use the same command to restart to continue crawling!

Q3: How to make the details page crawled before the list page?

A: Generate details pageRequestAdd whenpriority=5(The default list page priority is 0 or 1, the larger the number, the priority), the scheduler will dequeue according to priority.


🏷️ tag cloud:Scrapy 核心组件 Engine Scheduler Downloader Spiders Pipeline 爬虫架构 数据采集