Incremental crawling practice: Redis fingerprint verification, bandwidth optimization
📂 Stage: Stage 4 - Practical Exercise (Project Development)
1. What is an incremental crawler?
Before talking about the code, let's first clarify the difference between incremental crawler and "normal full crawler"—— Every time the full crawler is started, all target pages will be crawled again, regardless of whether the content has changed; Incremental crawlers only crawl pages that "appear for the first time" or "the content has been updated".
Novices may think it doesn’t matter, but once the project goes online (such as monitoring competitive product prices, real-time hotspot aggregation), the cost of full crawling will be very eye-catching:
- Waste of server bandwidth and computing resources
- The crawling speed is slow and the update frequency cannot keep up.
- It is extremely easy to trigger the anti-crawling mechanism of the target website
So today we use Scrapy + Redis to implement a set of the most basic and versatile URL fingerprint incremental crawler.
2. Practical code: URL fingerprint verification version incremental crawler
2.1 Preparation
Install dependencies first:
Make sure the local or remote Redis service is started (default port 6379, no password is required during demonstration, Authentication must be turned on in production environment).
2.2 Complete code analysis
2.3 Start crawler test
Enter the Scrapy project root directory in the terminal and run:
- First execution: Crawl the entire site and generate a file containing author information
quotes.json。 - Second execution: Only home page and page turning requests will appear in the terminal, there will be no details page requests - incremental crawling will take effect successfully!
3. Expand optimization direction
3.1 More than just URL fingerprinting
Some website URLs remain unchanged but the content is dynamically updated (news home page, e-commerce product page, etc.), so URL fingerprinting alone will not work. At this time you can consider:
- Response content hash: Hash the entire page or key fields (note to filter out random parts such as advertisements, timestamps, etc.)
- HTTP response header: exploit
Last-ModifiedorETagDetermine whether the page has been modified - Key information comparison: Cooperate with MySQL and Elasticsearch to save business fields such as price and inventory, and only capture changed data
3.2 Production environment Redis recommendations
- Allocate a database separately (e.g.
db=1) fingerprint the crawler to avoid mixing - Authentication password is required, use SSL/TLS if necessary
- Use Redis Cluster instead of a single machine during cluster deployment
- Turn on when memory is tight
maxmemory-policy allkeys-lru, automatically eliminate cold data
4. Summary
The core logic of incremental crawling is just one sentence: First determine whether to crawl or not, and then decide whether to crawl or not.
The URL fingerprint version of the incremental crawler demonstrated in this article is basic but has a wide range of applications - especially suitable for scenarios where "static list pages continuously push new details pages":
- Monitor news sites for new articles
- Capture new forum posts
- Synchronize new products on e-commerce platforms
💡 Remember: In production-level crawler projects, incremental crawling is one of the most cost-effective optimization methods**, no other.
🔗 Extended reading

