Scrapy-Redis distributed architecture - building a high-performance distributed crawler cluster

📂 Stage: Stage 5 - Combat Power Upgrade (Distributed and Advanced) 🔗 Related chapters: DownloaderMiddleware · Spider中间件深度定制 · 分布式去重与调度

When the amount of crawler tasks exceeds the capacity of a single machine, there will always be a ceiling for increasing the CPU or bandwidth vertically. Scrapy-Redis breaks this ceiling - it implements shared request queue and distributed deduplication collection based on Redis, allowing multiple Scrapy instances on multiple machines to work together like a well-trained team. This article will take you from architectural understanding to hands-on deployment, step by step to build a stable and scalable crawler cluster.


1. Architecture Overview: How distributed crawlers “divide labor”

The core idea of ​​Scrapy-Redis is simple: use Redis as the dispatch center and deduplication center, and each crawler node only crawls and processes data. All nodes share the same request queue and crawled URL fingerprint set, thus avoiding repeated crawling and naturally achieving load balancing.

The flow chart below shows a complete distributed crawling process:

graph TB
    A[初始URL<br/>Redis键] --> B{Redis集群}
    subgraph B
        Q[共享请求队列<br/>支持优先级/FIFO/LIFO]
        DF[共享去重集合<br/>RFPDupeFilter]
    end
    Q --> C1[爬虫节点1]
    Q --> C2[爬虫节点2]
    Q --> C3[爬虫节点N]
    C1/C2/C3 --> DF
    C1/C2/C3 -->|新URL入队| Q
  • Shared request queue: Saved in Redis, all nodes take tasks from here and put back the new links they discover.
  • Shared deduplication set: also exists in Redis and is used to store the fingerprint of each URL. Any node will check whether the fingerprint already exists before crawling, thus completely avoiding repeated crawling.
  • Centerless Scheduling: Redis itself does not actively allocate tasks, but relies on the nodes to "consciously" pull in a blocking manner, virtually achieving load sharing based on processing speed.

2. Get started quickly: Set up your first distributed cluster in three steps

2.1 Install dependencies

pip install scrapy-redis redis

Make sure all crawler nodes have the same version of dependencies installed.

2.2 Configure Scrapy project

in yoursettings.pyComplete the following replacements in , which are the three core actions from stand-alone to distributed:

# settings.py
import os

# 1. 启用 Redis 调度器,替代默认的内存调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 2. 启用 Redis 指纹去重,替代默认的内存去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 3. 配置 Redis 连接(支持哨兵/集群)
REDIS_URL = os.getenv('REDIS_URL', 'redis://localhost:6379/0')
# 也可以分开配置:
# REDIS_HOST = 'localhost'
# REDIS_PORT = 6379
# REDIS_PASSWORD = 'your_pwd'
# REDIS_DB = 0

# 4. 强烈推荐启用断点续爬
SCHEDULER_PERSIST = True  # 爬虫关闭后保留队列,重启可继续

# 5. 选择队列类型(默认优先级队列)
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

Tip:SCHEDULER_PERSIST = TrueWill keep the queue data in Redis from being lost, which is very suitable for long-running crawlers; if you want to start fresh every time, you can set it toFalse

2.3 Transform Spider

We need to change the native Spider to obtain the starting URL from Redis and no longer usestart_urls. here withRedisSpiderFor example:

# spiders/distributed_spider.py
from scrapy_redis.spiders import RedisSpider
from scrapy.http import Request
import time

class DemoDistributedSpider(RedisSpider):
    name = 'demo_distributed'
    redis_key = f'{name}:start_urls'   # 共享的初始URL键,所有节点都会监听它

    custom_settings = {
        'CONCURRENT_REQUESTS': 32,     # 单节点并发,视目标网站承受能力调整
        'DOWNLOAD_DELAY': 1,
    }

    def parse(self, response):
        # 提取数据
        yield {
            'url': response.url,
            'title': response.css('title::text').get(),
            'node_id': self._get_node_id(),
            'timestamp': time.time()
        }

        # 提取新链接并生成请求(自动进入Redis队列和去重)
        for link in response.css('a::attr(href)').getall()[:10]:  # 示例仅取前10条
            absolute_url = response.urljoin(link)
            yield Request(url=absolute_url, callback=self.parse)

    def _get_node_id(self):
        import socket
        return socket.gethostname()

NOTE:RedisSpiderclass variables will be ignoredstart_urls, the starting task must be passed a Redis key (such asdemo_distributed:start_urls) manually injected.


3. Start the cluster: one command, multiple terminals in parallel

3.1 Start the Redis service

Make sure all crawler nodes can access Redis. It is recommended to use Sentinel Mode or Redis Cluster in the production environment to avoid single points of failure.

3.2 Inject starting URL

Open the Redis command line and push the first seed URL into the queue:

redis-cli lpush demo_distributed:start_urls https://example.com

Multiple seeds can be pushed at one time, each seed corresponding to one or more starting pages.

3.3 Start the crawler node

Execute the same command on each node:

scrapy crawl demo_distributed

There is no limit to the number of nodes, you can add new machines at any time, and the crawler will automatically share the tasks. All node outputs can be seen in the output from different machines.node_idIdentifies and verifies that the distributed collaboration has taken effect.


4. In-depth analysis of the core mechanism

4.1 Shared request queue: Selection strategies for three queues

Scrapy-Redis passedSCHEDULER_QUEUE_CLASSThree queue implementations are supported, essentially using different data structures of Redis:

Queue typeRedis implementationBehavior characteristicsApplicable scenarios
PriorityQueue (default)ZSet (score = -priority)The smaller the number, the higher the priority; the same priority is in the order of joiningImportant pages need to be crawled first, such as homepage and core directory
FifoQueueList (LPUSH + BRPOP)First in, first out, similar to a normal queueConventional breadth first crawling, stable and orderly
LifoQueueList (LPUSH + LPOP)Last in, first out, similar to stackQuickly detect new links, but be careful to avoid deep recursion

Example switching to a first-in-first-out queue:

SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'

4.2 Shared fingerprint deduplication: How does your crawler remember “places visited”

defaultRFPDupeFilterEach request will bemethod + url + bodyCalculate it into a SHA1 fingerprint and store it in the Redis Set. Its simplified logic is as follows:

import hashlib

def default_fingerprint(request):
    fp = hashlib.sha1()
    fp.update(request.method.encode('utf-8'))
    fp.update(request.url.encode('utf-8'))
    fp.update(request.body or b'')
    return fp.hexdigest()

Because all nodes share the same Set, after any link is processed once, other nodes will not crawl again.

4.3 Load Balancing: Automatic “He who can do more work”

Scrapy-Redis's load balancing is passive and requires no additional configuration:

  • Node passedBRPOPWait blocking commands wait for new requests.
  • Nodes with fast processing speed will take away tasks faster, while slow nodes or nodes that are temporarily stuck will naturally receive fewer requests.
  • If you need more granular sharding based on domain names or certain rules, you need to customize the scheduler or add routing logic.

5. Production environment deployment and operation and maintenance

5.1 Redis connection and retry strategy

Recommended insettings.pypassREDIS_PARAMSFine-grained control over connection behavior:

REDIS_PARAMS = {
    'socket_timeout': 30,
    'socket_connect_timeout': 30,
    'retry_on_timeout': True,
    'encoding': 'utf-8',
    'health_check_interval': 30,  # 定期检查连接有效性
    'max_connections': 20,        # 每个节点连接池大小
    # 哨兵模式示例(解注释并填写实际信息)
    # 'sentinel': [('sentinel1', 26379), ('sentinel2', 26379)],
    # 'sentinel_service_name': 'mymaster',
    # 'sentinel_password': 'sentinel_pwd',
}

5.2 Breakpoint resume crawling and queue cleaning

  • Continue climbing: Just guaranteeSCHEDULER_PERSIST = True, after the crawler is interrupted and restarted, it will continue to process the remaining tasks.
  • Start over: If you want to clear the history and crawl again, you need to manually delete the relevant keys in Redis:
    redis-cli
    > DEL demo_distributed:requests     # 清理请求队列
    > DEL demo_distributed:dupefilter   # 清理去重集合
    > DEL demo_distributed:start_urls   # 清理初始种子

5.3 Monitor Redis memory

Distributed crawlers will continue to write requests and fingerprints to Redis. It is recommended to set a reasonable memory upper limit and elimination strategy in the Redis configuration:

maxmemory <字节数>
maxmemory-policy allkeys-lru

Regularly pay attention to the queue length and fingerprint set size, and if necessary, you can regularly clean up expired deduplicated fingerprints (need to expand by yourself).


6. Best practices and common pitfalls

6.1 Best Practices

  • Limit concurrency and delay: Reasonable settings based on the target website’s affordabilityCONCURRENT_REQUESTSDOWNLOAD_DELAYCONCURRENT_REQUESTS_PER_DOMAINparameters to avoid being blocked.
  • Enable Redis persistence: Mix RDB and AOF to prevent task loss due to unexpected restarts.
  • Planning seed strategy: The starting URL should cover key pages as much as possible, and can be pushed into Redis in batches in combination with Sitemap or database.
  • Logging and Monitoring: Add a unique identifier for each node (such asnode_id), and access centralized logs (ELK/Loki) and monitoring (Prometheus) to facilitate viewing the status of each node and crawling progress.

6.2 Common pitfall avoidance guidelines

  • ❌ Do not configure bothstart_urlsandredis_key: Once usedRedisSpiderstart_urlswill be ignored and all starting tasks must be injected from Redis.
  • ❌ Don’t hardcode Redis key names in your code: PassednameVariables or configuration files are dynamically constructed to avoid key name conflicts in multiple crawler projects.
  • ❌ Don’t ignore Redis connection exceptions: Be sure to configure reasonable timeout and retry parameters, and cooperate with health checks to prevent crawler stagnation due to dead connections.
  • ❌ Don’t let the request queue expand infinitely: Check the queue length regularly, filter or limit abnormally growing URL rules, and avoid Redis memory overflow.

7. Expand ideas

After completing the above infrastructure, you can further expand according to actual needs:

  • Customized deduplication logic: inheritanceRFPDupeFilterand rewriterequest_fingerprint, to achieve more flexible URL deduplication (such as ignoring parameter ordering).
  • Dynamic seed injection: Write a script to obtain seeds from the database or API, and push them into Redis regularly to achieve continuous incremental crawling.
  • Node automatic scaling: Combined with Docker/Kubernetes, crawler instances are automatically increased or decreased based on the queue length indicator.
  • Distributed Data Pipeline: Usescrapy-redisComes with itRedisPipelineOr write data directly to Kafka or MongoDB to achieve decoupling of fetching and processing.

Now you have mastered the complete link from principle to deployment of Scrapy-Redis. Following the steps in this article, you can build a stable, horizontally scalable distributed crawler cluster in a short time, improving crawler efficiency several times or even dozens of times.