Scrapyd and ScrapydWeb - Detailed explanation of distributed crawler deployment and monitoring platform
📂 Stage: Stage 6 - Operation, Maintenance and Monitoring (Engineering) 🔗 Related chapters: Scrapy-Redis分布式架构 · Docker容器化爬虫 · 抓取监控看板
When your crawler is no longer just one or two scripts that are manually run in the terminal, but becomes dozens or hundreds of projects that need to be regularly scheduled, monitored, and modified at any time, manually staring at the command line is no longer enough. Scrapyd and ScrapydWeb were born to solve this pain point: one is responsible for background silent scheduling, and the other provides a beautiful web interface for one-stop management. This article will simplify the complexity and help you build a production-ready crawler operation and maintenance platform in the shortest possible time.
Table of contents
Tool Overview
Scrapyd
Scrapyd is a lightweight HTTP daemon officially launched by Scrapy. It listens by default.6800port and provides a set of RESTful APIs. Its responsibilities are very focused:
- Manage multiple projects and versions and support deploying new versions at any time
- Start, pause, and cancel crawler tasks
- Automatic maintenance process, which can be pulled up after a crash
- Automatically save crawler running logs and metadata
You can think of it as a "task scheduler + process manager" specially built for Scrapy.
ScrapydWeb
ScrapydWeb is a third-party developed Web UI management panel that can connect to one or even multiple Scrapyd servers. Its highlights are:
- Schedule crawlers graphically, filling in parameters is as easy as filling out a form
- View, search, and filter crawler logs in real time
- Unified management and control of multi-node clusters, one operation is distributed to all nodes -Supports email alerts after task completion or failure
- Automatically parse statistical data in logs (such as number of items, number of requests, etc.)
The combination of the two is like installing an automated cockpit for your crawler team.
Quick installation and core configuration
1. Scrapyd daemon
Install
It is recommended to create a separate virtual environment for the Scrapy ecosystem to avoid package conflicts:
scrapyd-clientsupplyscrapyd-deployCommand, used to package and upload projects.
Core configuration
Create a configuration file (Linux production environments usually place/etc/scrapyd/scrapyd.conf, Windows or test environment can be placed in the project directory):
Notice:max_procIt is the total number of processes running Scrapyd at the same time, not the number of crawlers. For example, if you schedule 4 crawlers at the same time, and each crawler starts 2 child processes, the limit may be exceeded and needs to be adjusted according to the actual CPU.
start up
Temporary tests can be run directly in the foreground:
It is recommended to use systemd hosting for production environments to achieve automatic startup and automatic recovery. create/etc/systemd/system/scrapyd.service:
Then execute:
2. ScrapydWeb Management Panel
Install
Install in the same virtual environment or a new environment:
Configuration
After starting for the first time, the program will automatically generate a copy in the current directory.config.py. You need to focus on adjusting the following options:
It can also be created using systemd/etc/systemd/system/scrapydweb.serviceTo daemonize the process:
Start and set up autostart:
Now open your browser to visithttp://你的IP:5000, enter the username and password, and you will see the management interface.
Deploy your first project in 10 minutes
Step 1: Modify local project configuration
In your local Scrapy project root directory, openscrapy.cfg, add a deployment target:
Step 2: Install dependencies in advance
Scrapyd won't automatically read yourrequirements.txt, so you need to log in to the Scrapyd server in advance and install it manually in the corresponding virtual environment:
Step 3: Package and upload
Return to the local project directory and run the deployment command:
If the terminal outputDeployed myproject:v202604101230Information like this means the deployment was successful.
Daily operation and maintenance: scheduling, viewing, canceling
It is highly recommended to use ScrapydWeb for daily operations, which turns boring API calls into a few mouse clicks:
- Select the Scrapyd node you want to operate on at the top
- Click "Schedule" on the left, select the project and crawler, fill in the parameters, and start with one click
- Switch to the "Jobs" page and you can see the Pending (queuing), Running (running), and Finished (completed) tasks in real time.
- Click "Log" on the right side of the task to view the log in real time
- When you need to cancel, just click "Cancel"
If you need to write automated scripts, you can also call Scrapyd's HTTP API directly. Here are examples of the 5 most commonly used interfaces:
Production environment reinforcement
Firewall Policy
- Never directly put Scrapyd
6800The port is exposed on the public network because it has no authentication mechanism. - Only open to ScrapydWeb
5000(or after Nginx proxy443) to operation and maintenance personnel or office VPN. - Scrapyd
6800The port is only open to the intranet, local machine, or the server where ScrapydWeb is located.
Nginx reverse proxy + HTTPS
Apply HTTPS to ScrapydWeb through Nginx, which is both safe and professional. Here is a common Nginx configuration example (assuming you have obtained the certificate using Certbot):
Docker one-click deployment (optional)
If you are used to containerization, you can use the followingdocker-compose.ymlQuick setup:
Pitfall avoidance guide and best practices
⚠️ Guide to avoid pitfalls
max_procIt is not the number of crawlers: it is the upper limit of global concurrent processes.max_proc_per_hostControl resources to avoid filling up the server.- Forgot to install dependency: Be sure to execute it on the server before deployment
pip install, otherwise the crawler will fail directly due to missing modules. - Port directly exposed: Scrapyd does not have login authentication, so do not expose it directly
6800Open it to the public network, otherwise anyone can schedule and cancel your crawler. - Path permission error: Scrapyd needs to
eggs_dir、logs_dirWait for the directory to have read and write permissions. Remember this when deploying for the first time.chownone time.
✅ Best Practices
- Prepare an independent virtual environment or Docker image for each project to eliminate dependency conflicts from the source.
- Configuration files are managed by environment: development, testing, and production use different
scrapy.cfgandconfig.py, and switch via environment variables or deployment scripts. - Regularly clean up old versions and expired logs: Pass
delversion.jsonAPI removes useless versions and combineslogs_to_keepAutomatically rotate logs. - Monitoring and Alerting: Enable ScrapydWeb’s email alerts, or use a simple script to schedule requests
http://scrapyd:6800/daemonstatus.jsonto monitor service status. - Resource Planning: Reasonable settings based on the number of CPU cores of the server and the actual crawler load
max_procandmax_proc_per_host, leaving margin for the system and other services.
💡 Core Points: Scrapyd is responsible for back-end scheduling, and ScrapydWeb is responsible for front-end visualization. Together, the two can quickly build a production-level crawler operation and maintenance platform. As long as the production environment grasps the three security points of "Firewall + Authentication + HTTPS", it can run stably and with confidence.

