Scrapy-Redis distributed architecture - building a high-performance distributed crawler cluster
📂 Stage: Stage 5 - Combat Power Upgrade (Distributed and Advanced) 🔗 Related chapters: DownloaderMiddleware · Spider中间件深度定制 · 分布式去重与调度
When the amount of crawler tasks exceeds the capacity of a single machine, there will always be a ceiling for increasing the CPU or bandwidth vertically. Scrapy-Redis breaks this ceiling - it implements shared request queue and distributed deduplication collection based on Redis, allowing multiple Scrapy instances on multiple machines to work together like a well-trained team. This article will take you from architectural understanding to hands-on deployment, step by step to build a stable and scalable crawler cluster.
1. Architecture Overview: How distributed crawlers “divide labor”
The core idea of Scrapy-Redis is simple: use Redis as the dispatch center and deduplication center, and each crawler node only crawls and processes data. All nodes share the same request queue and crawled URL fingerprint set, thus avoiding repeated crawling and naturally achieving load balancing.
The flow chart below shows a complete distributed crawling process:
- Shared request queue: Saved in Redis, all nodes take tasks from here and put back the new links they discover.
- Shared deduplication set: also exists in Redis and is used to store the fingerprint of each URL. Any node will check whether the fingerprint already exists before crawling, thus completely avoiding repeated crawling.
- Centerless Scheduling: Redis itself does not actively allocate tasks, but relies on the nodes to "consciously" pull in a blocking manner, virtually achieving load sharing based on processing speed.
2. Get started quickly: Set up your first distributed cluster in three steps
2.1 Install dependencies
Make sure all crawler nodes have the same version of dependencies installed.
2.2 Configure Scrapy project
in yoursettings.pyComplete the following replacements in , which are the three core actions from stand-alone to distributed:
Tip:
SCHEDULER_PERSIST = TrueWill keep the queue data in Redis from being lost, which is very suitable for long-running crawlers; if you want to start fresh every time, you can set it toFalse。
2.3 Transform Spider
We need to change the native Spider to obtain the starting URL from Redis and no longer usestart_urls. here withRedisSpiderFor example:
NOTE:
RedisSpiderclass variables will be ignoredstart_urls, the starting task must be passed a Redis key (such asdemo_distributed:start_urls) manually injected.
3. Start the cluster: one command, multiple terminals in parallel
3.1 Start the Redis service
Make sure all crawler nodes can access Redis. It is recommended to use Sentinel Mode or Redis Cluster in the production environment to avoid single points of failure.
3.2 Inject starting URL
Open the Redis command line and push the first seed URL into the queue:
Multiple seeds can be pushed at one time, each seed corresponding to one or more starting pages.
3.3 Start the crawler node
Execute the same command on each node:
There is no limit to the number of nodes, you can add new machines at any time, and the crawler will automatically share the tasks. All node outputs can be seen in the output from different machines.node_idIdentifies and verifies that the distributed collaboration has taken effect.
4. In-depth analysis of the core mechanism
4.1 Shared request queue: Selection strategies for three queues
Scrapy-Redis passedSCHEDULER_QUEUE_CLASSThree queue implementations are supported, essentially using different data structures of Redis:
Example switching to a first-in-first-out queue:
4.2 Shared fingerprint deduplication: How does your crawler remember “places visited”
defaultRFPDupeFilterEach request will bemethod + url + bodyCalculate it into a SHA1 fingerprint and store it in the Redis Set. Its simplified logic is as follows:
Because all nodes share the same Set, after any link is processed once, other nodes will not crawl again.
4.3 Load Balancing: Automatic “He who can do more work”
Scrapy-Redis's load balancing is passive and requires no additional configuration:
- Node passed
BRPOPWait blocking commands wait for new requests. - Nodes with fast processing speed will take away tasks faster, while slow nodes or nodes that are temporarily stuck will naturally receive fewer requests.
- If you need more granular sharding based on domain names or certain rules, you need to customize the scheduler or add routing logic.
5. Production environment deployment and operation and maintenance
5.1 Redis connection and retry strategy
Recommended insettings.pypassREDIS_PARAMSFine-grained control over connection behavior:
5.2 Breakpoint resume crawling and queue cleaning
- Continue climbing: Just guarantee
SCHEDULER_PERSIST = True, after the crawler is interrupted and restarted, it will continue to process the remaining tasks. - Start over: If you want to clear the history and crawl again, you need to manually delete the relevant keys in Redis:
5.3 Monitor Redis memory
Distributed crawlers will continue to write requests and fingerprints to Redis. It is recommended to set a reasonable memory upper limit and elimination strategy in the Redis configuration:
Regularly pay attention to the queue length and fingerprint set size, and if necessary, you can regularly clean up expired deduplicated fingerprints (need to expand by yourself).
6. Best practices and common pitfalls
6.1 Best Practices
- Limit concurrency and delay: Reasonable settings based on the target website’s affordability
CONCURRENT_REQUESTS、DOWNLOAD_DELAY、CONCURRENT_REQUESTS_PER_DOMAINparameters to avoid being blocked. - Enable Redis persistence: Mix RDB and AOF to prevent task loss due to unexpected restarts.
- Planning seed strategy: The starting URL should cover key pages as much as possible, and can be pushed into Redis in batches in combination with Sitemap or database.
- Logging and Monitoring: Add a unique identifier for each node (such as
node_id), and access centralized logs (ELK/Loki) and monitoring (Prometheus) to facilitate viewing the status of each node and crawling progress.
6.2 Common pitfall avoidance guidelines
- ❌ Do not configure both
start_urlsandredis_key: Once usedRedisSpider,start_urlswill be ignored and all starting tasks must be injected from Redis. - ❌ Don’t hardcode Redis key names in your code: Passed
nameVariables or configuration files are dynamically constructed to avoid key name conflicts in multiple crawler projects. - ❌ Don’t ignore Redis connection exceptions: Be sure to configure reasonable timeout and retry parameters, and cooperate with health checks to prevent crawler stagnation due to dead connections.
- ❌ Don’t let the request queue expand infinitely: Check the queue length regularly, filter or limit abnormally growing URL rules, and avoid Redis memory overflow.
7. Expand ideas
After completing the above infrastructure, you can further expand according to actual needs:
- Customized deduplication logic: inheritance
RFPDupeFilterand rewriterequest_fingerprint, to achieve more flexible URL deduplication (such as ignoring parameter ordering). - Dynamic seed injection: Write a script to obtain seeds from the database or API, and push them into Redis regularly to achieve continuous incremental crawling.
- Node automatic scaling: Combined with Docker/Kubernetes, crawler instances are automatically increased or decreased based on the queue length indicator.
- Distributed Data Pipeline: Use
scrapy-redisComes with itRedisPipelineOr write data directly to Kafka or MongoDB to achieve decoupling of fetching and processing.
Now you have mastered the complete link from principle to deployment of Scrapy-Redis. Following the steps in this article, you can build a stable, horizontally scalable distributed crawler cluster in a short time, improving crawler efficiency several times or even dozens of times.

