Scrapyd and ScrapydWeb - Detailed explanation of distributed crawler deployment and monitoring platform

📂 Stage: Stage 6 - Operation, Maintenance and Monitoring (Engineering) 🔗 Related chapters: Scrapy-Redis分布式架构 · Docker容器化爬虫 · 抓取监控看板

When your crawler is no longer just one or two scripts that are manually run in the terminal, but becomes dozens or hundreds of projects that need to be regularly scheduled, monitored, and modified at any time, manually staring at the command line is no longer enough. Scrapyd and ScrapydWeb were born to solve this pain point: one is responsible for background silent scheduling, and the other provides a beautiful web interface for one-stop management. This article will simplify the complexity and help you build a production-ready crawler operation and maintenance platform in the shortest possible time.

Table of contents

  1. 工具概览
  2. 快速安装与核心配置
  3. 10分钟部署你的第一个项目
  4. 日常运维:调度、查看、取消
  5. 生产环境加固
  6. 避坑指南与最佳实践

Tool Overview

Scrapyd

Scrapyd is a lightweight HTTP daemon officially launched by Scrapy. It listens by default.6800port and provides a set of RESTful APIs. Its responsibilities are very focused:

  • Manage multiple projects and versions and support deploying new versions at any time
  • Start, pause, and cancel crawler tasks
  • Automatic maintenance process, which can be pulled up after a crash
  • Automatically save crawler running logs and metadata

You can think of it as a "task scheduler + process manager" specially built for Scrapy.

ScrapydWeb

ScrapydWeb is a third-party developed Web UI management panel that can connect to one or even multiple Scrapyd servers. Its highlights are:

  • Schedule crawlers graphically, filling in parameters is as easy as filling out a form
  • View, search, and filter crawler logs in real time
  • Unified management and control of multi-node clusters, one operation is distributed to all nodes -Supports email alerts after task completion or failure
  • Automatically parse statistical data in logs (such as number of items, number of requests, etc.)

The combination of the two is like installing an automated cockpit for your crawler team.


Quick installation and core configuration

1. Scrapyd daemon

Install

It is recommended to create a separate virtual environment for the Scrapy ecosystem to avoid package conflicts:

python -m venv scrapy_env
source scrapy_env/bin/activate        # Linux / Mac
# .\scrapy_env\Scripts\activate       # Windows

pip install scrapyd scrapyd-client

scrapyd-clientsupplyscrapyd-deployCommand, used to package and upload projects.

Core configuration

Create a configuration file (Linux production environments usually place/etc/scrapyd/scrapyd.conf, Windows or test environment can be placed in the project directory):

[scrapyd]
# 绑定地址,生产环境建议先设为 127.0.0.1,后续再由内网或代理访问
bind_address = 0.0.0.0
port = 6800

# 最大同时运行的进程数,设为 CPU 核数或按需求调整
max_proc = 4
# 单个项目最多同时运行的进程数
max_proc_per_host = 2

# 日志保留天数
logs_to_keep = 14
# 任务记录保留数
keep_jobs = 10

# 数据存储路径(请确保该运行用户有读写权限)
eggs_dir = /var/lib/scrapyd/eggs
logs_dir = /var/log/scrapyd
dbs_dir = /var/lib/scrapyd/dbs

# 生产环境关闭调试模式
debug = off

Notice:max_procIt is the total number of processes running Scrapyd at the same time, not the number of crawlers. For example, if you schedule 4 crawlers at the same time, and each crawler starts 2 child processes, the limit may be exceeded and needs to be adjusted according to the actual CPU.

start up

Temporary tests can be run directly in the foreground:

scrapyd -c /etc/scrapyd/scrapyd.conf

It is recommended to use systemd hosting for production environments to achieve automatic startup and automatic recovery. create/etc/systemd/system/scrapyd.service

[Unit]
Description=Scrapyd Service
After=network.target

[Service]
Type=simple
User=your_user
WorkingDirectory=/var/lib/scrapyd
ExecStart=/path/to/scrapy_env/bin/scrapyd -c /etc/scrapyd/scrapyd.conf
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Then execute:

sudo systemctl daemon-reload
sudo systemctl enable --now scrapyd

2. ScrapydWeb Management Panel

Install

Install in the same virtual environment or a new environment:

pip install scrapydweb

Configuration

After starting for the first time, the program will automatically generate a copy in the current directory.config.py. You need to focus on adjusting the following options:

import os

# 安全密钥,请换成随机字符串
SECRET_KEY = 'your-random-secret-key-here'

# ScrapydWeb 绑定的地址和端口
SCRAPYDWEB_BIND = '0.0.0.0'
SCRAPYDWEB_PORT = 5000

# 开启登录认证(生产必须打开)
ENABLE_AUTHENTICATION = True
USERNAME = 'admin'
PASSWORD = 'your-strong-password-here'

# 要管理的 Scrapyd 节点列表,支持多台
SCRAPYD_SERVERS = [
    'localhost:6800',           # 本地
    # '192.168.1.100:6800',     # 远程节点1
    # '192.168.1.101:6800',     # 远程节点2
]

# 开启日志自动解析
ENABLE_LOGPARSER = True

# 邮件告警(可选)
ENABLE_EMAIL_ALERT = False

It can also be created using systemd/etc/systemd/system/scrapydweb.serviceTo daemonize the process:

[Unit]
Description=ScrapydWeb Service
After=network.target scrapyd.service

[Service]
Type=simple
User=your_user
WorkingDirectory=/path/to/config/dir
ExecStart=/path/to/scrapy_env/bin/scrapydweb -c /path/to/config.py
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Start and set up autostart:

sudo systemctl daemon-reload
sudo systemctl enable --now scrapydweb

Now open your browser to visithttp://你的IP:5000, enter the username and password, and you will see the management interface.


Deploy your first project in 10 minutes

Step 1: Modify local project configuration

In your local Scrapy project root directory, openscrapy.cfg, add a deployment target:

[settings]
default = myproject.settings

[deploy:local]
url = http://localhost:6800/          # 你的 Scrapyd 地址
project = myproject                   # 项目名称,保持唯一

Step 2: Install dependencies in advance

Scrapyd won't automatically read yourrequirements.txt, so you need to log in to the Scrapyd server in advance and install it manually in the corresponding virtual environment:

source /path/to/scrapy_env/bin/activate
pip install -r /path/to/project/requirements.txt

Step 3: Package and upload

Return to the local project directory and run the deployment command:

scrapyd-deploy local -p myproject

If the terminal outputDeployed myproject:v202604101230Information like this means the deployment was successful.


Daily operation and maintenance: scheduling, viewing, canceling

It is highly recommended to use ScrapydWeb for daily operations, which turns boring API calls into a few mouse clicks:

  1. Select the Scrapyd node you want to operate on at the top
  2. Click "Schedule" on the left, select the project and crawler, fill in the parameters, and start with one click
  3. Switch to the "Jobs" page and you can see the Pending (queuing), Running (running), and Finished (completed) tasks in real time.
  4. Click "Log" on the right side of the task to view the log in real time
  5. When you need to cancel, just click "Cancel"

If you need to write automated scripts, you can also call Scrapyd's HTTP API directly. Here are examples of the 5 most commonly used interfaces:

# 1. 列出已部署的项目
curl http://localhost:6800/listprojects.json

# 2. 查看某个项目下的所有爬虫
curl "http://localhost:6800/listspiders.json?project=myproject"

# 3. 启动爬虫(额外参数用 -d arg_参数名=参数值)
curl -X POST http://localhost:6800/schedule.json \
  -d project=myproject \
  -d spider=myspider \
  -d arg_start_url=https://example.com

# 4. 查看项目下的所有任务
curl "http://localhost:6800/listjobs.json?project=myproject"

# 5. 取消指定任务
curl -X POST http://localhost:6800/cancel.json \
  -d project=myproject \
  -d job=xxx-your-job-id-xxx

Production environment reinforcement

Firewall Policy

  • Never directly put Scrapyd6800The port is exposed on the public network because it has no authentication mechanism.
  • Only open to ScrapydWeb5000(or after Nginx proxy443) to operation and maintenance personnel or office VPN.
  • Scrapyd6800The port is only open to the intranet, local machine, or the server where ScrapydWeb is located.

Nginx reverse proxy + HTTPS

Apply HTTPS to ScrapydWeb through Nginx, which is both safe and professional. Here is a common Nginx configuration example (assuming you have obtained the certificate using Certbot):

upstream scrapydweb_backend {
    server localhost:5000;
}

server {
    listen 80;
    server_name scrapyd.yourdomain.com;
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name scrapyd.yourdomain.com;

    ssl_certificate     /etc/letsencrypt/live/scrapyd.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/scrapyd.yourdomain.com/privkey.pem;
    include /etc/letsencrypt/options-ssl-nginx.conf;
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem;

    # 基本安全头
    add_header X-Frame-Options DENY;
    add_header X-Content-Type-Options nosniff;
    add_header X-XSS-Protection "1; mode=block";
    client_max_body_size 10M;

    location / {
        proxy_pass http://scrapydweb_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 300s;
    }
}

Docker one-click deployment (optional)

If you are used to containerization, you can use the followingdocker-compose.ymlQuick setup:

version: '3.8'

services:
  scrapyd:
    image: python:3.9-slim
    container_name: scrapyd-server
    volumes:
      - ./scrapyd.conf:/etc/scrapyd/scrapyd.conf
      - ./logs:/var/log/scrapyd
      - ./eggs:/var/lib/scrapyd/eggs
    ports:
      - "127.0.0.1:6800:6800"          # 仅本机访问
    command: >
      bash -c "
      pip install scrapyd scrapyd-client pymongo requests &&
      scrapyd -c /etc/scrapyd/scrapyd.conf
      "
    restart: unless-stopped

  scrapydweb:
    image: my8100/scrapydweb:latest
    container_name: scrapydweb-ui
    volumes:
      - ./config.py:/app/config.py
      - ./data:/app/data
      - ./logs:/app/logs
    ports:
      - "0.0.0.0:5000:5000"
    depends_on:
      - scrapyd
    restart: unless-stopped

Pitfall avoidance guide and best practices

⚠️ Guide to avoid pitfalls

  1. max_procIt is not the number of crawlers: it is the upper limit of global concurrent processes.max_proc_per_hostControl resources to avoid filling up the server.
  2. Forgot to install dependency: Be sure to execute it on the server before deploymentpip install, otherwise the crawler will fail directly due to missing modules.
  3. Port directly exposed: Scrapyd does not have login authentication, so do not expose it directly6800Open it to the public network, otherwise anyone can schedule and cancel your crawler.
  4. Path permission error: Scrapyd needs toeggs_dirlogs_dirWait for the directory to have read and write permissions. Remember this when deploying for the first time.chownone time.

✅ Best Practices

  1. Prepare an independent virtual environment or Docker image for each project to eliminate dependency conflicts from the source.
  2. Configuration files are managed by environment: development, testing, and production use differentscrapy.cfgandconfig.py, and switch via environment variables or deployment scripts.
  3. Regularly clean up old versions and expired logs: Passdelversion.jsonAPI removes useless versions and combineslogs_to_keepAutomatically rotate logs.
  4. Monitoring and Alerting: Enable ScrapydWeb’s email alerts, or use a simple script to schedule requestshttp://scrapyd:6800/daemonstatus.jsonto monitor service status.
  5. Resource Planning: Reasonable settings based on the number of CPU cores of the server and the actual crawler loadmax_procandmax_proc_per_host, leaving margin for the system and other services.

💡 Core Points: Scrapyd is responsible for back-end scheduling, and ScrapydWeb is responsible for front-end visualization. Together, the two can quickly build a production-level crawler operation and maintenance platform. As long as the production environment grasps the three security points of "Firewall + Authentication + HTTPS", it can run stably and with confidence.