When crawlers meet RabbitMQ: efficient decoupled distributed crawling practice
Open a single Python script and use requests + beautifulsoup4 to crawl small websites, and the efficiency is so-so; but once you need to crawl tens of thousands of dynamic data and mixed content across multiple domain names, problems will arise: single thread is too slow, multi-thread/multi-process** task allocation is confusing, repeated crawling is difficult to control, and nodes are busy when they hang up.
At this time, Message Queuing (MQ) becomes the "savior scheduler" of distributed crawlers - it completely decouples the production and consumption of tasks: you just throw URLs into the queue, and the crawling nodes themselves grab tasks, work, and report results. Even better, MQ also comes with advanced capabilities such as task priority, persistence, and automatic retry. Among many mature MQs, RabbitMQ has become the first choice for entry-level to production-level distributed crawling with its multi-protocol support, intuitive web management interface and simple Python client (pika).
This article starts with rapid deployment and builds step by step a distributed crawler message queue architecture that is executable, reliable, and supports priority.
1. Quick overview of core features: choosing the right features for your crawler
RabbitMQ has many functions, but for crawlers, the following five features are the keys to knowing, improving efficiency, and ensuring stability:
2. Set up the local environment in 3 minutes
2.1 Start RabbitMQ with one click (Docker Party Gospel)
Docker is the easiest to use for local development. There is no need to install Erlang separately. Even the management interface is opened:
Open **http://localhost:15672**,账号密码默认都是 after startupguest, you can see the concise console after logging in.
2.2 Install Python client
Use the officially recommended lightweight library pika to connect to RabbitMQ, and install the basic crawler library:
3. Infrastructure run-through: producers send URLs and consumers crawl URLs
The entire infrastructure has only two roles:
- Task Producer: Generate the URL to be crawled (or the encapsulated task object) and throw it into the MQ "Task Queue to be Crawled";
- Crawler Consumer: grab tasks from the "task queue to be crawled", manually confirm after the crawling is completed, and then continue to get the next one.
3.1 Producer code: just lose the URL
3.2 Consumer code: Confirm after completion
4. Advanced optimization for crawlers
The infrastructure is up and running, but production-level projects usually still need: priority control, failed task handling, and postback of crawling results (such as writing the title to the database or sending it to another queue).
4.1 Priority Queue: Climb the hot list first, then the history
Set a priority of 1 to 10 for the task (the larger the number, the higher the priority), and hotspot data can be consumed by "jumping in line".
Producer Adjustments:
Consumer code does not need to be modified, as long as the queue declaration is consistent with the producer, the priority mechanism will automatically take effect.
4.2 Dead Letter Queue (DLQ): Give failed tasks a home
Tasks that fail after multiple retries (such as network timeouts, page structure changes) should not remain stuck in the queue. You can use RabbitMQ's Dead Letter Queue to store them separately to facilitate manual inspection and decide whether to re-deliver.
Simplified version of the implementation idea:
- When defining the main queue, specify
x-dead-letter-exchangeandx-dead-letter-routing-keyparameter; - Create an independent "dead letter queue" and bind it to the above switch/routing key;
- When the consumer fails to process, call
ch.basic_nack(delivery_tag=..., requeue=False)Reject the task and MQ will automatically post it to the dead letter queue.
This way, normal retry mechanisms (e.g. viarequeue=TruePushing back to the queue) will not affect the final destination of the failed task.
5. Performance and stability tips
- Reasonable use of connections and channels: It is best for a consumer to only establish one TCP connection. Multiple threads can reuse the connection but should create channels separately (channels in pika are lightweight).
- Enable heartbeat: Production environment servers usually disconnect connections without communication for a long time.
ConnectionParametersChina plusheartbeat=60Avoid being killed by mistake. - Avoid message accumulation: Check the queue length of the management interface regularly, and if necessary, increase the number of consumers or enable multi-process consumption.
- Task deduplication: If necessary, Redis or Bloom filters can be introduced on the producer side to avoid repeated URLs being queued.
6. Monitoring and Management
Browser access **http://localhost:15672**,善用以下页面:
- Queues: View statistics on the current number of messages, number of consumers, and message delivery/confirmation/rejection of each queue;
- Connections / Channels: Check the current active connections and channels, and quickly troubleshoot abnormal disconnections;
- Overview: Master global message traffic and node health status.
7. Further reading
Through this article, you have built a distributed crawler infrastructure from scratch that is highly decoupled, supports persistence, and has priority. In actual projects, you can further expand the "deduplication queue" and "data cleaning queue", or encapsulate the tasks into classes and usepickleSerialization (note: pickle is only available in trusted environments). When crawlers meet RabbitMQ, you say goodbye to the pain of manual scheduling and let your crawler cluster truly "breathe autonomously".

