title: Detailed explanation of Scrapy's five core components - Engine, Scheduler, Downloader, Spiders, Pipeline in-depth analysis | Daoman PythonAI description: Deeply understand the architectural design, working principles and collaborative working mechanism of Scrapy's five core components: Engine, Scheduler, Downloader, Spiders, and Pipeline. Master the underlying architecture of the Scrapy crawler framework. keywords: [Scrapy, core components, Engine, Scheduler, Downloader, Spiders, Pipeline, crawler architecture, crawler framework] date: 2026-04-10 updated: 2024-01-15 author: DaomanPythonAI tags: [Scrapy, crawler architecture, core components, crawler framework]
Detailed explanation of the five core components of Scrapy - How do the five internal organs work together?
Hello everyone, I am Daoman PythonAI! In the last article, we understood 为什么选Scrapy (asynchronous, high concurrency, and strong scalability). Today we will take it apart to learn directly - 5-piece core set is not mysterious. After reading it, you will understand how Scrapy can crawl data "fast, steadily, and skillfully"!
📂 Stage: Stage 1 - fledgling (core framework) 🔗 Related chapters: 为什么选择 Scrapy? · 创建你的首个工程
Table of contents
- 1分钟看全局架构
- Engine(核心指挥家)
- Scheduler(任务调度管家)
- Downloader(内容搬运工)
- Spiders(解析逻辑师)
- Pipeline(数据加工厂)
- 完整协同工作流(必看!)
- 优化小锦囊
- 常见问题快答
Look at the global architecture in 1 minute
Scrapy adopts event-driven asynchronous architecture, with 5 core components each performing their own duties. Engine acts as the "general commander" to connect all links in series, automatically forming a data pipeline:
Minimalist data flow direction
You only need to write "what to parse and how to link to new ones" in Spider, and the rest of the scheduling, downloading, and storage are all taken care of by the framework.
Engine (core conductor)
Engine is the "brain" of Scrapy. It does not do specific crawling work, but determines "when, what to do, and who to hand over" to all components.
Core Responsibilities
- Event loop driver: Use Twisted asynchronous library to take over the execution rhythm of the entire process
- Component life cycle management: start, run, pause, and stop the crawler
- Signal Distribution and Processing: Coordinate status notifications between components
- Data flow monitoring: Ensure that Requests/Responses/Items are correctly transferred between components
Common life cycle signals
Configuration Notes
These parameters can help you control the crawling scale and resource consumption to avoid infinite running.
Scheduler (Task Scheduling Manager)
Scheduler is a "warehouse administrator + priority manager", which specializes in managing the Requests queue to be crawled, and achieves the three major functions of removing duplication, queuing, and continuing to crawl.
Core functions
- Remove: Used by default
RFPDupeFilterGenerate request fingerprint (hash value) to determine whether it has been crawled - Queue: Supports FIFO / LIFO / priority queue, first in first out by default
- Persistence: Settings
JOBDIRFinally, the interrupted task can be resumed from the breakpoint!
Configuration and techniques
Customized deduplication (simple version)
Downloader (content porter)
Downloader is a "high-speed courier", responsible for sending HTTP requests, receiving responses, and steadily moving web pages back by relying on connection reuse, automatic speed limiting, and downloader middleware.
Core subsystem
- Downloader Middleware: Intercepts Requests/Responses, can modify User-Agent, set proxy, handle redirects and retries
- Connection Pool: Reuse TCP connections and reduce three-way handshake overhead
- AutoThrottle: Dynamically adjust the request delay according to the response speed of the target website
Production-level configuration
Spiders (Analytical Logician)
Spiders is the only place where you need to write business logic! Define the starting URL here, parse the HTTP response, and produce new requests and Item data.
The two most commonly used Spiders
scrapy.Spider: Flexible and lightweight, suitable for customized extraction logicCrawlSpider:CooperateRule+LinkExtractorCan automatically discover links, suitable for crawling the entire site
Basic Spider simplest example
Key points:yieldThe new request is handed over to the Engine to continue scheduling.yieldThe dictionary will be converted into Item and passed through the Pipeline.
Pipeline (data processing factory)
Pipeline is a "quality inspection + warehousing pipeline", which cleans, verifies, and deduplicates the Items produced by Spider, and then safely saves them to a database or file.
Pipeline life cycle
Each Pipeline must implement at least one of the following two methods (more are optional):
open_spider(self, spider): Executed when the crawler starts (such as initializing the MySQL connection)process_item(self, item, spider):**Must write! ** Process each Item and return the Item (or throwDropItemthrow away)close_spider(self, spider): Executed when the crawler stops (such as closing the connection)
Execution sequence configuration
Complete Collaborative Workflow (Must See!)
7 steps to walk through the entire process
-
Initialization Start the crawler → Engine loads Spider → Scheduler initializes the queue → Downloader prepares the connection pool
-
Initiate initial request Spider generation
start_urls→ Engine forwards to Scheduler → Scheduler removes duplicates and then joins the queue -
Get request Engine takes the highest priority Request from Scheduler → forwards it to Downloader
-
Download Downloader sends HTTP request → gets response → hands it to Engine
-
Analysis Engine sends the response to Spider corresponding
callback→ Spider parses out new Requests and Items -
Looping and Storage
- New Requests → Engine → Scheduler → Repeat steps 3‑5
- Items → Engine → Execute all Pipelines in sequence → Save to the target location
- Stop The Scheduler queue is empty and no new Request is generated → Engine shuts down all components → The crawler ends
Optimization Tips
Performance optimization
- Set concurrency reasonably: global
16~64, single domain name2~8, not too high to avoid being blocked - Automatic speed limit must be enabled:
AUTOTHROTTLE_ENABLED = True, let the framework adapt to - Reuse middleware logic: Don’t pile all the anti-crawling code in Spider. Extracting middleware is easier to maintain.
Memory optimization
- Discard invalid items in time: If the conditions are not met in the Pipeline,
raise DropItem - Limit crawling scale: use
CLOSESPIDER_TIMEOUT、CLOSESPIDER_PAGECOUNT、CLOSESPIDER_ITEMCOUNTcontrol
FAQ Quick Answer
Q1: How does Spider pass parameters to Pipeline?
A: You can use it directlyspider.settings.get(), or define custom properties in Spider (such asspider.custom_param), passed in PipelinespiderObject access.
Q2: How to resume the interrupted task?
A: Insettings.pyMedium settingsJOBDIR = 'crawls/xxx/', the crawler will save the intermediate state after it is started. After interruption, use the same command to restart to continue crawling!
Q3: How to make the details page crawled before the list page?
A: Generate details pageRequestAdd whenpriority=5(The default list page priority is 0 or 1, the larger the number, the priority), the scheduler will dequeue according to priority.
🏷️ tag cloud:Scrapy 核心组件 Engine Scheduler Downloader Spiders Pipeline 爬虫架构 数据采集

