Grab the monitoring dashboard - Detailed explanation of real-time monitoring and alarming of the crawler system

📂 Stage: Stage 6 - Operation, Maintenance and Monitoring (Engineering) 🔗 Related chapters: Scrapyd与ScrapydWeb · Docker容器化爬虫 · Scrapy-Redis分布式架构

Table of contents

Monitoring system overview

If a stably operating crawler system is compared to a sailing ship, then monitoring is the captain's "radar" and "instrument panel." It turns abstract system status into intuitive data charts through index collection, log aggregation, visual display, and automated alarms, allowing you to control the crawler's every move at any time.

Why must engineered crawlers be monitored?

A small stand-alone script does not need to be monitored, but when the crawler moves towards distributed, long-term running, and high availability, anomalies every minute may mean:

  • Data Loss: The crawling task quietly crashed, and only a few days later it was discovered that the data was incomplete;
  • Waste of resources: Memory leaks cause server OOM, or empty running tasks occupy the bandwidth;
  • Troubleshooting difficulties: Facing hundreds of G original logs, it takes half a day to locate a problem.

Good monitoring can fundamentally solve these pain points - pre-warning, mid-event positioning, and post-event review.

Minimalist and implementable architecture

Many tutorials recommend ELK + Prometheus + Grafana as a complete set of family buckets. The deployment cost is too high for small and medium-sized crawler projects. Here we recommend a set of Lightweight Three Musketeers Architecture, which uses free cloud services and does not require you to install any server components yourself:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────────┐
│  Scrapy/Scrapyd │───▶│  指标/日志暴露  │───▶│  Grafana Cloud      │
│  (爬虫实例)     │    │  (中间件 + SDK) │    │  (托管 Prom / Loki) │
└─────────────────┘    └─────────────────┘    └─────────────────────┘
  • Grafana Cloud provides free quota and comes with Prometheus (indicators), Loki (log), and Grafana (visualization), which can be used immediately after registration;
  • The crawler side only needs to integrate the official websiteprometheus_clientLibrary, which exposes a small number of core indicators for docking;
  • Available when testing locallyngrokWait for the intranet penetration tool to let Prometheus on the cloud capture local indicators.

Next we put up this set of shelves step by step.


Quick practice of core components

1. Use Prometheus Client to expose crawler indicators

Scrapy itself does not have built-in monitoring. We can use a downloader middleware to achieve tracking. The following middleware updates Prometheus counters, dashboards, and histograms during each request/response/exception lifecycle:

# scrapy_project/middlewares.py
from prometheus_client import start_http_server, Counter, Gauge, Histogram
import time

# ---------- 全局指标定义 ----------
REQUESTS = Counter(
    'scrapy_requests_total', '总请求数',
    ['spider', 'status']
)
ERRORS = Counter(
    'scrapy_errors_total', '总错误数',
    ['spider', 'error_type']
)
ACTIVE_CONCURRENCY = Gauge(
    'scrapy_active_concurrency', '当前活跃请求数',
    ['spider']
)
RESPONSE_TIME = Histogram(
    'scrapy_response_time_seconds', '响应时间分布',
    ['spider'],
    buckets=[0.1, 0.5, 1, 3, 10]
)

class PrometheusMetricsMiddleware:
    """
    Scrapy 下载器中间件,用于向 Prometheus 暴露抓取指标。
    启动一个 HTTP 服务,专门让 Prometheus server 拉取。
    """
    def __init__(self, settings):
        self.port = settings.getint('PROMETHEUS_METRICS_PORT', 8000)
        # 启动指标暴露服务(与 Scrapy 进程共享同一进程)
        start_http_server(self.port)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def process_request(self, request, spider):
        ACTIVE_CONCURRENCY.labels(spider=spider.name).inc()
        request.meta['start_time'] = time.time()

    def process_response(self, request, response, spider):
        if 'start_time' in request.meta:
            duration = time.time() - request.meta['start_time']
            RESPONSE_TIME.labels(spider=spider.name).observe(duration)
        ACTIVE_CONCURRENCY.labels(spider=spider.name).dec()
        REQUESTS.labels(
            spider=spider.name, status=str(response.status)
        ).inc()
        return response

    def process_exception(self, request, exception, spider):
        ACTIVE_CONCURRENCY.labels(spider=spider.name).dec()
        error_type = type(exception).__name__
        ERRORS.labels(spider=spider.name, error_type=error_type).inc()

then insettings.pyActivate the middleware and specify a port (if there are multiple crawler processes, each uses a different port):

DOWNLOADER_MIDDLEWARES = {
    'scrapy_project.middlewares.PrometheusMetricsMiddleware': 100,
}
PROMETHEUS_METRICS_PORT = 8001

After starting the crawler, visithttp://localhost:8001/metricsYou will see raw data similar to the following:

# HELP scrapy_requests_total 总请求数
# TYPE scrapy_requests_total counter
scrapy_requests_total{spider="example",status="200"} 152.0
scrapy_requests_total{spider="example",status="404"} 3.0
# TYPE scrapy_response_time_seconds histogram
scrapy_response_time_seconds_bucket{spider="example",le="0.1"} 12.0
...

This is the collection endpoint of Prometheus. Next we let Grafana Cloud pull this data.

2. Access Grafana Cloud free hosting service

**Why choose Grafana Cloud? ** The built-in alarms, high availability, multi-tenancy, and permanent free version are enough for small and medium-sized teams to use, eliminating the trouble of building and maintaining them by themselves.

The steps are as follows:

  1. Register an account Visit Grafana Cloud and use GitHub/Google to quickly register.

  2. Create Prometheus data source connection Enter "Connections" on the left → "Add new connection" and searchPrometheus, select "Hosted Prometheus metrics" → "Create a Prometheus data source". You will get a remote writing URL and credentials (username/API Key). We will later use this information to push local Prometheus metrics to the cloud. A simpler approach is to let Prometheus on the cloud directly pull the local/metricsendpoint.

  3. Configure remote pull (requires an indicator port reachable by the public network) If it is a local development environment, you can usengrokWilllocalhost:8001Exposed to the public network:

    ngrok http 8001

After execution you will get something likehttps://xxxx.ngrok.iopublic network address. Back in Grafana Cloud, in the data source configuration, changeScrape intervalKeep the default 30 seconds atCustom HTTP HeadersIgnore it (because we don’t need authentication for now), and thenPrometheus scrape targetJust set it to your ngrok address.

In a production environment, it is recommended to ensure security through an intranet dedicated line or VPC Peering, and do not expose crawler indicators directly to the public network.

  1. Import Kanban templates with one click The Grafana community provides a large number of ready-made templates. Enter "Dashboards" → "New" → "Import" and enter the Kanban template ID763(Example), select the newly configured Prometheus data source to get a beautiful real-time monitoring panel. You can see:
  • Total requests curve (success/failure trend)
  • Error Rate Panel
  • Active concurrency count real-time value
  • Response time quantiles (P50/P95/P99)

In this way, a zero-operation and maintenance, zero-cost crawler monitoring panel is completed.


Troubleshooting and Diagnosis

After the monitoring panel shows an abnormality, we need to quickly locate the root cause. Below is a lightweight diagnostic toolbox that can be integrated into the crawler project to perform health checks at any time.

Diagnostic tool implementation

# diagnostic_tools.py
import psutil
import requests
import socket
import time
from datetime import datetime
import logging
from typing import Dict, List, Optional
import json

class DiagnosticTools:
    """爬虫故障诊断工具(轻量版)"""

    def __init__(self):
        self.logger = logging.getLogger(__name__)

    def check_system_resources(self) -> Dict:
        """检查系统资源,超过 85% 自动标记告警"""
        cpu = psutil.cpu_percent(interval=1)
        mem = psutil.virtual_memory().percent
        disk = psutil.disk_usage('/').percent
        procs = len(psutil.pids())

        warnings = []
        if cpu > 85:
            warnings.append('CPU 使用率过高')
        if mem > 85:
            warnings.append('内存使用率过高')
        if disk > 85:
            warnings.append('磁盘空间不足')

        return {
            'cpu_percent': cpu,
            'memory_percent': mem,
            'disk_percent': disk,
            'process_count': procs,
            'warnings': warnings
        }

    def check_network_connectivity(self, urls: List[str]) -> Dict:
        """测试目标站点或中间件的网络可达性"""
        results = {}
        for url in urls:
            try:
                start = time.time()
                resp = requests.head(url, timeout=5, allow_redirects=True)
                results[url] = {
                    'status': 'success',
                    'code': resp.status_code,
                    'latency_seconds': round(time.time() - start, 3)
                }
            except Exception as e:
                results[url] = {
                    'status': 'failed',
                    'reason': str(e)
                }
        return results

    def check_port_availability(self, host: str, ports: List[int]) -> Dict:
        """检查关键端口(Scrapyd 6800,Redis 6379,指标 8001 等)"""
        status = {}
        for port in ports:
            sock = None
            try:
                sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
                sock.settimeout(2)
                sock.connect((host, port))
                status[port] = 'open'
            except Exception:
                status[port] = 'closed'
            finally:
                if sock:
                    sock.close()
        return status

    def run_full_diagnosis(self, spider_hosts: List[str] = None) -> str:
        """生成完整的可读诊断报告"""
        report = [f"# 爬虫诊断报告 - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"]
        report.append("\n## 系统资源")
        sys_res = self.check_system_resources()
        report.append(json.dumps(sys_res, indent=2, ensure_ascii=False))
        report.append("\n## 关键端口(本地)")
        port_res = self.check_port_availability('localhost', [6800, 6379, 8001])
        report.append(json.dumps(port_res, indent=2))
        return "\n".join(report)

# 快速命令行自检
if __name__ == "__main__":
    diag = DiagnosticTools()
    print(diag.run_full_diagnosis())

This report can be sent directly to the enterprise WeChat/DingTalk alarm channel to achieve automated diagnosis.

Common faults cheat sheet

PhenomenonTroubleshooting steps
Fetch success rate plummets1. View Prometheus metricsscrapy_requests_totalCurves grouped by status code, especially focusing on the 5xx/4xx ratio. 2. Runcheck_network_connectivityTest target site. 3. Check whether anti-crawling (verification code, IP ban) is triggered.
The crawler stopped running inexplicably1. Check the system resource panel to confirm whether the memory/CPU is full. 2. 查看 Scrapyd 的任务日志(logs/项目/蜘蛛/...) to find the exception stack. 3. Check whether there is an uncaught exception that caused the reactor to exit.
Grafana no data1. Access on local machinehttp://localhost:8001/metricsConfirm that the endpoint is OK. 2. Check whether the scrape configuration of Grafana Cloud is correct. 3. If using ngrok, check whether the tunnel is stable (sometimes a restart is required).

Monitoring Best Practices

  1. Safety first, intranet first In the production environment, do not expose the indicator port to the public network. Use Prometheus's remote write function to push data to the cloud, or through a VPC private network channel. Local debugging using ngrok can only be used as a temporary solution.

  2. Reasonable Alarm Rules

  • Error rate: An alarm will be issued only if the error rate exceeds 5% for 5 consecutive minutes to avoid false alarms caused by instantaneous fluctuations.
  • Resource: Triggered when memory usage exceeds 90% for 3 minutes.
  • Heartbeat: Each crawler should have regular heartbeat indicators (for example, reporting once every minuteupvalue), if it is lost for more than 2 minutes, an alarm will be issued.
  1. Layered Monitoring Panel It is recommended to build a three-level billboard:
  • Overview layer: For operation and maintenance/responsible persons, it displays the QPS, error rate, and system health of the entire platform.
  • Project layer: For colleagues who develop specific crawling tasks, regarding the request volume, delay, data volume, etc. of a single spider.
  • Single task layer: Used when troubleshooting problems, including detailed log streams, latest running parameters, and dependent service status.
  1. Clean data regularly The free version of Grafana Cloud has corresponding log and indicator retention times (13 months for Prometheus and 30 days for Loki). Be careful not to pour test data into it without limit. You can do this in Prometheus.scrape_configsettings injob_nameMake distinctions to facilitate management.

  2. Gradual evolution When the team is starting from a small scale, the above-mentioned "Three Lightweight Musketeers" are completely sufficient. As the scale of the crawler increases, you can consider upgrading to self-built Prometheus + Grafana, using Loki to replace ELK unified logs, and using Tempo to supplement distributed tracing to smooth the transition and avoid over-design.

Monitoring is not a static decoration, it should be continuously iterated with the business. I hope this tutorial can help you quickly set up a practical and low-cost crawler monitoring system, so that your crawling tasks can no longer fly blindly!