Grab the monitoring dashboard - Detailed explanation of real-time monitoring and alarming of the crawler system
📂 Stage: Stage 6 - Operation, Maintenance and Monitoring (Engineering) 🔗 Related chapters: Scrapyd与ScrapydWeb · Docker容器化爬虫 · Scrapy-Redis分布式架构
Table of contents
Monitoring system overview
If a stably operating crawler system is compared to a sailing ship, then monitoring is the captain's "radar" and "instrument panel." It turns abstract system status into intuitive data charts through index collection, log aggregation, visual display, and automated alarms, allowing you to control the crawler's every move at any time.
Why must engineered crawlers be monitored?
A small stand-alone script does not need to be monitored, but when the crawler moves towards distributed, long-term running, and high availability, anomalies every minute may mean:
- Data Loss: The crawling task quietly crashed, and only a few days later it was discovered that the data was incomplete;
- Waste of resources: Memory leaks cause server OOM, or empty running tasks occupy the bandwidth;
- Troubleshooting difficulties: Facing hundreds of G original logs, it takes half a day to locate a problem.
Good monitoring can fundamentally solve these pain points - pre-warning, mid-event positioning, and post-event review.
Minimalist and implementable architecture
Many tutorials recommend ELK + Prometheus + Grafana as a complete set of family buckets. The deployment cost is too high for small and medium-sized crawler projects. Here we recommend a set of Lightweight Three Musketeers Architecture, which uses free cloud services and does not require you to install any server components yourself:
- Grafana Cloud provides free quota and comes with Prometheus (indicators), Loki (log), and Grafana (visualization), which can be used immediately after registration;
- The crawler side only needs to integrate the official website
prometheus_clientLibrary, which exposes a small number of core indicators for docking; - Available when testing locally
ngrokWait for the intranet penetration tool to let Prometheus on the cloud capture local indicators.
Next we put up this set of shelves step by step.
Quick practice of core components
1. Use Prometheus Client to expose crawler indicators
Scrapy itself does not have built-in monitoring. We can use a downloader middleware to achieve tracking. The following middleware updates Prometheus counters, dashboards, and histograms during each request/response/exception lifecycle:
then insettings.pyActivate the middleware and specify a port (if there are multiple crawler processes, each uses a different port):
After starting the crawler, visithttp://localhost:8001/metricsYou will see raw data similar to the following:
This is the collection endpoint of Prometheus. Next we let Grafana Cloud pull this data.
2. Access Grafana Cloud free hosting service
**Why choose Grafana Cloud? ** The built-in alarms, high availability, multi-tenancy, and permanent free version are enough for small and medium-sized teams to use, eliminating the trouble of building and maintaining them by themselves.
The steps are as follows:
-
Register an account Visit Grafana Cloud and use GitHub/Google to quickly register.
-
Create Prometheus data source connection Enter "Connections" on the left → "Add new connection" and search
Prometheus, select "Hosted Prometheus metrics" → "Create a Prometheus data source". You will get a remote writing URL and credentials (username/API Key). We will later use this information to push local Prometheus metrics to the cloud. A simpler approach is to let Prometheus on the cloud directly pull the local/metricsendpoint. -
Configure remote pull (requires an indicator port reachable by the public network) If it is a local development environment, you can use
ngrokWilllocalhost:8001Exposed to the public network:
After execution you will get something likehttps://xxxx.ngrok.iopublic network address.
Back in Grafana Cloud, in the data source configuration, changeScrape intervalKeep the default 30 seconds atCustom HTTP HeadersIgnore it (because we don’t need authentication for now), and thenPrometheus scrape targetJust set it to your ngrok address.
In a production environment, it is recommended to ensure security through an intranet dedicated line or VPC Peering, and do not expose crawler indicators directly to the public network.
- Import Kanban templates with one click
The Grafana community provides a large number of ready-made templates. Enter "Dashboards" → "New" → "Import" and enter the Kanban template ID
763(Example), select the newly configured Prometheus data source to get a beautiful real-time monitoring panel. You can see:
- Total requests curve (success/failure trend)
- Error Rate Panel
- Active concurrency count real-time value
- Response time quantiles (P50/P95/P99)
In this way, a zero-operation and maintenance, zero-cost crawler monitoring panel is completed.
Troubleshooting and Diagnosis
After the monitoring panel shows an abnormality, we need to quickly locate the root cause. Below is a lightweight diagnostic toolbox that can be integrated into the crawler project to perform health checks at any time.
Diagnostic tool implementation
This report can be sent directly to the enterprise WeChat/DingTalk alarm channel to achieve automated diagnosis.
Common faults cheat sheet
Monitoring Best Practices
-
Safety first, intranet first In the production environment, do not expose the indicator port to the public network. Use Prometheus's remote write function to push data to the cloud, or through a VPC private network channel. Local debugging using ngrok can only be used as a temporary solution.
-
Reasonable Alarm Rules
- Error rate: An alarm will be issued only if the error rate exceeds 5% for 5 consecutive minutes to avoid false alarms caused by instantaneous fluctuations.
- Resource: Triggered when memory usage exceeds 90% for 3 minutes.
- Heartbeat: Each crawler should have regular heartbeat indicators (for example, reporting once every minute
upvalue), if it is lost for more than 2 minutes, an alarm will be issued.
- Layered Monitoring Panel It is recommended to build a three-level billboard:
- Overview layer: For operation and maintenance/responsible persons, it displays the QPS, error rate, and system health of the entire platform.
- Project layer: For colleagues who develop specific crawling tasks, regarding the request volume, delay, data volume, etc. of a single spider.
- Single task layer: Used when troubleshooting problems, including detailed log streams, latest running parameters, and dependent service status.
-
Clean data regularly The free version of Grafana Cloud has corresponding log and indicator retention times (13 months for Prometheus and 30 days for Loki). Be careful not to pour test data into it without limit. You can do this in Prometheus.
scrape_configsettings injob_nameMake distinctions to facilitate management. -
Gradual evolution When the team is starting from a small scale, the above-mentioned "Three Lightweight Musketeers" are completely sufficient. As the scale of the crawler increases, you can consider upgrading to self-built Prometheus + Grafana, using Loki to replace ELK unified logs, and using Tempo to supplement distributed tracing to smooth the transition and avoid over-design.
Monitoring is not a static decoration, it should be continuously iterated with the business. I hope this tutorial can help you quickly set up a practical and low-cost crawler monitoring system, so that your crawling tasks can no longer fly blindly!

