Incremental crawling practice: Redis fingerprint verification, bandwidth optimization

📂 Stage: Stage 4 - Practical Exercise (Project Development)


1. What is an incremental crawler?

Before talking about the code, let's first clarify the difference between incremental crawler and "normal full crawler"—— Every time the full crawler is started, all target pages will be crawled again, regardless of whether the content has changed; Incremental crawlers only crawl pages that "appear for the first time" or "the content has been updated".

Novices may think it doesn’t matter, but once the project goes online (such as monitoring competitive product prices, real-time hotspot aggregation), the cost of full crawling will be very eye-catching:

  • Waste of server bandwidth and computing resources
  • The crawling speed is slow and the update frequency cannot keep up.
  • It is extremely easy to trigger the anti-crawling mechanism of the target website

So today we use Scrapy + Redis to implement a set of the most basic and versatile URL fingerprint incremental crawler.


2. Practical code: URL fingerprint verification version incremental crawler

2.1 Preparation

Install dependencies first:

pip install scrapy redis

Make sure the local or remote Redis service is started (default port 6379, no password is required during demonstration, Authentication must be turned on in production environment).

2.2 Complete code analysis

import scrapy
import redis
import hashlib
from scrapy.http import Request   # 补全原示例中未显式导入的 Request

class IncrementalSpider(scrapy.Spider):
    name = "incremental_demo"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/"]

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # 连接 Redis(生产环境建议将配置统一放在 settings.py 里)
        self.redis_client = redis.Redis(
            host="127.0.0.1",
            port=6379,
            db=0,
            decode_responses=True   # 返回字符串,省去手动 .decode()
        )

    def parse(self, response):
        for quote_card in response.css("div.quote"):
            detail_url = quote_card.css("span small a::attr(href)").get()
            full_detail_url = response.urljoin(detail_url)

            # 生成 URL 指纹(MD5 快速够用,极严格场景可换 SHA256)
            url_fingerprint = hashlib.md5(
                full_detail_url.encode("utf-8")
            ).hexdigest()

            redis_key = f"crawled_urls:{url_fingerprint}"

            # 如果不见 → 新 URL → 发起请求并记录指纹
            if not self.redis_client.exists(redis_key):
                yield Request(
                    url=full_detail_url,
                    callback=self.parse_quote_detail,
                    meta={"author": quote_card.css("span small.author::text").get()}
                )
                # 设置指纹,并让键 7 天后自动过期,防止 Redis 内存溢出
                self.redis_client.set(redis_key, "1", ex=604800)  # 604800s = 7天

        # 处理分页
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_quote_detail(self, response):
        yield {
            "author": response.meta["author"],
            "author_birth_date": response.css("span.author-born-date::text").get(),
            "author_birth_place": response.css("span.author-born-location::text").get(),
            "quote": response.css("div.quote span.text::text").get().strip('“”'),
            "tags": response.css("div.quote div.tags a.tag::text").getall()
        }

2.3 Start crawler test

Enter the Scrapy project root directory in the terminal and run:

scrapy crawl incremental_demo -o quotes.json
  • First execution: Crawl the entire site and generate a file containing author informationquotes.json
  • Second execution: Only home page and page turning requests will appear in the terminal, there will be no details page requests - incremental crawling will take effect successfully!

3. Expand optimization direction

3.1 More than just URL fingerprinting

Some website URLs remain unchanged but the content is dynamically updated (news home page, e-commerce product page, etc.), so URL fingerprinting alone will not work. At this time you can consider:

  • Response content hash: Hash the entire page or key fields (note to filter out random parts such as advertisements, timestamps, etc.)
  • HTTP response header: exploitLast-ModifiedorETagDetermine whether the page has been modified
  • Key information comparison: Cooperate with MySQL and Elasticsearch to save business fields such as price and inventory, and only capture changed data

3.2 Production environment Redis recommendations

  • Allocate a database separately (e.g.db=1) fingerprint the crawler to avoid mixing
  • Authentication password is required, use SSL/TLS if necessary
  • Use Redis Cluster instead of a single machine during cluster deployment
  • Turn on when memory is tightmaxmemory-policy allkeys-lru, automatically eliminate cold data

4. Summary

The core logic of incremental crawling is just one sentence: First determine whether to crawl or not, and then decide whether to crawl or not.

The URL fingerprint version of the incremental crawler demonstrated in this article is basic but has a wide range of applications - especially suitable for scenarios where "static list pages continuously push new details pages":

  • Monitor news sites for new articles
  • Capture new forum posts
  • Synchronize new products on e-commerce platforms

💡 Remember: In production-level crawler projects, incremental crawling is one of the most cost-effective optimization methods**, no other.


🔗 Extended reading