Distributed deduplication and scheduling - detailed explanation of efficient deduplication algorithm and distributed coordination mechanism
📂 Stage: Stage 5 - Combat Power Upgrade (Distributed and Advanced) 🔗 Related chapters: Scrapy-Redis分布式架构 · Spider中间件深度定制 · 大规模爬虫优化
Table of contents
The core challenges and selections of distributed deduplication
When the crawler expands from a single machine to multiple nodes working together, repeated crawling becomes the biggest cost black hole. Not only do you have to ensure that the same node does not process the same URL repeatedly, but you also need to ensure that there is no "crash" between nodes. This presents several real challenges.
Three core challenges
Comparison of mainstream deduplication solutions
The following table lists several common solutions. After reading this, you will know when to choose which one.
What false positive means: the filter may occasionally tell you "this URL already exists", when in fact it is new. This is usually acceptable with large data volumes, because we can also confirm it twice through other means.
Bloom filter: the king of space efficiency
Bloom Filter (Bloom Filter) is a very smart probabilistic data structure. It has two unwavering properties:
- The judgment of "does not exist" is 100% accurate - As long as it says that the URL has not been processed, then you can rest assured.
- There is a very low misjudgment rate in judging "existence" - Occasionally, a new URL will be "unjustly accused", but the probability can be controlled.
Its working principle can be understood as: each URL is mapped to a very long binary array by multiple hash functions; when adding, the corresponding positions are set to 1, and when querying, check whether these positions are all 1.
Simplified version of Bloom filter
The following implementation omits complex parameter derivation and directly uses preconfigured values (bit array length and number of hash functions) that have been verified in actual combat. You can think of it as an out-of-the-box tool.
The running results are similar:
Redis distributed deduplication landing
Local bloom filters run very fast, but only live in the memory of a single process. To make it serve the entire crawler cluster, we need to move the bit array to Redis - all nodes read and write the same Bloom filter through Redis.
Redis version of Bloom filter (single instance)
Using RedissetbitandgetbitOperation, it only takes up a very small amount of memory, but can support the judgment of billions of URLs.
Distributed lock: ensure concurrency security
If multiple crawler nodes add URLs to Redis at the same time, or fetch new task seeds at the same time, it is easy for data competition to occur. For example: both nodes thought they were the first to get a certain seed, and ended up crawling a URL twice.
At this time, distributed lock is needed to control concurrency. Provided by RedisSET key value NX EXThe command can atomically implement "only set when the key does not exist, and automatically set the expiration time".
The simplest and most practical Redis distributed lock
Although this implementation is simple, it has solved the two core problems:
- Lock atomicity – multiple nodes will not acquire locks at the same time
- Automatic Expiration - If the node holding the lock crashes, the lock will not be permanently occupied
The Lua script is used to release the lock to ensure the atomicity of the two-step operations of "checking the identification" and "deleting" to avoid accidentally deleting the locks just acquired by other nodes.
Consistent Hash: Load Balancing Artifact
When we assign crawler tasks to different nodes, we generally use the hash modulus method (hash(task) % N). But once the number of nodes changes (expansion, shrinkage, failure), almost all tasks need to be reallocated - which will cause large-scale cache invalidation and state migration.
Consistent hashing maps both nodes and tasks onto a ring, with each task assigned clockwise to the first reached node. When nodes are added or deleted, only the tasks of adjacent nodes need to be migrated, and most others remain unchanged.
Simplified version of consistent hash ring
Output example:
Each node handles approximately one-third of the load. And when you add a fourth node, only about a quarter of the original tasks will be migrated to the new node, greatly reducing system jitter.
Core Optimization Strategy
Now that you have mastered the basic tools, here are some optimization tips for immediate results in production environments.
1. Batch processing
Every network round trip to Redis is overhead. existRedisBloomFilter.add()We have used pipeline to combine multiplesetbitCombined into one communication. Similarly, when querying batch URLs, you can also use pipeline to get all the bits at once.
2. Local cache
For URLs with a high recurrence rate (such as list pages and directory page links), you can add a layer of LRU cache in the process to avoid querying the Redis Bloom filter every time.
3. Segmented design
- Segment lock: Split shared resources into multiple segments, each segment has a lock to reduce competition
- Segmented Bloom: A single large Redis bloom may become a performance bottleneck and can be split into multiple bloom filters by URL prefix or hash value, scattered on different Redis instances
Best Practice Summary
Design principles at a glance
Quick location of frequently asked questions
This "duplication + scheduling" combination has supported multiple crawler clusters with an average of hundreds of millions of URLs per day. You can directly apply it according to your own business, or you can replace the components as needed (for example, replace Redis with Redis Cluster or alternatives). The key is to master the idea of layering, the combination of probability and accuracy, and the atomicity guarantee of distributed coordination.
🔗 Recommended related tutorials
🏷️ tag cloud:分布式去重 布隆过滤器 分布式锁 一致性哈希 Redis 爬虫优化

