Large-scale crawler optimization - detailed explanation of memory management, network optimization and performance tuning

📂 Stage: Stage 5 - Combat Power Upgrade (Distributed and Advanced) 🔗 Related chapters: 自动限速AutoThrottle · 数据去重与增量更新 · 分布式去重与调度

When your crawler upgrades from "small work on a single machine" to "massive collection across the entire network", you will definitely encounter three major obstacles: memory explosion, request ban, and data accumulation and collapse. Today Daoman will take you to dismantle the optimization logic of a truly enterprise-level large-scale crawler - it does not rely on blind heap configuration, but relies on "careful" resource management, intelligent adaptive strategies and all-round observability**. The full text is refined and all codes can be implemented👇


1. Set goals first, then take action - four core optimization goals

Many students increased the concurrency and changed the delay as soon as they started. As a result, they were either blocked or the memory was exhausted. Before optimizing, let’s first clarify what effect we want to achieve:

  • Stability: 7×24 hours unattended, automatically repaired when encountering abnormalities, without interrupting tasks
  • Efficiency: Use the least resources to crawl to the most data in unit time
  • Scalability: When adding machines and expanding nodes later, the code basically does not need to be changed significantly.
  • Observability: You can see at a glance where it is slow, where it hangs, and where it is throttled

All subsequent optimization measures are centered around these four goals.


2. Memory management - the most prone to overturning

Several scenarios where crawlers are most likely to burst memory:

  • There are tens of millions of links waiting to be crawled in the URL queue.
  • All response pages are cached in memory for parsing
  • Item data containing extremely long text (such as product descriptions of several thousand words) will not be truncated

In response to these, Daoman gives you three immediate solutions👇

2.1 Scrapy native configuration "Three Banaxes"

First, don’t set the memory limit on your head. Calculate it dynamically based on the available memory of your current machine:

# scrapy_settings.py
import psutil

# 使用本机可用内存的70%,同时设一个8GB的上限,避免占满整个机器
available_mem = psutil.virtual_memory().available // 1024 // 1024
MEMUSAGE_ENABLED = True
MEMUSAGE_LIMIT_MB = min(int(available_mem * 0.7), 8192)   # 本机70%或8GB封顶
MEMUSAGE_WARNING_MB = int(MEMUSAGE_LIMIT_MB * 0.75)       # 使用到75%时发出警告
CLOSESPIDER_MEMUSAGE = MEMUSAGE_LIMIT_MB                   # 超过上限自动关闭爬虫

# 限制单个响应下载大小,防止大文件撑爆内存
DOWNLOAD_MAXSIZE = 50 * 1024 * 1024   # 50MB(按需调整)
DOWNLOAD_WARNSIZE = 10 * 1024 * 1024
RESPONSE_ENCODING = 'utf-8'           # 统一编码,避免因转码产生内存碎片

# 用磁盘队列暂存海量URL,减轻内存压力
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

2.2 “Use and discard” in middleware

Scrapy will retain each response and each Item for a period of time by default. We can actively release data no longer needed through middleware:

# memory_optimization_middleware.py
import gc
import weakref
from scrapy import signals
from itemadapter import ItemAdapter

class MemoryOptimizationMiddleware:
    def __init__(self):
        # 使用弱引用保存Item,一旦没有其他强引用就会自动被垃圾回收
        self.item_weakref = weakref.WeakSet()
        # 只缓存最近1000条响应,多余的立即丢弃
        self.response_tmp = {}
        self.tmp_limit = 1000

    @classmethod
    def from_crawler(cls, crawler):
        ext = cls()
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def process_spider_input(self, response, spider):
        # 控制缓存大小,超出上限就删掉最旧的
        if len(self.response_tmp) >= self.tmp_limit:
            del self.response_tmp[list(self.response_tmp.keys())[0]]
        return None

    def process_spider_output(self, response, result, spider):
        for item_or_req in result:
            if hasattr(item_or_req, 'fields'):
                adapter = ItemAdapter(item_or_req)
                # 截断超长文本,比如文章正文限制在10000字符
                for k, v in adapter.items():
                    if isinstance(v, str) and len(v) > 10000:
                        adapter[k] = v[:10000] + "..."
                self.item_weakref.add(item_or_req)
            yield item_or_req

    def spider_closed(self, spider):
        # 爬虫结束时强制回收一次内存
        gc.collect()
        spider.logger.info("内存优化完成,已执行GC")

After configuring in this way, you will find that the memory curve becomes very smooth, and there will no longer be a heartbeat graph that is "slow at first and then rises sharply".


3. Network optimization and adaptive speed limit - Climb fast without being blocked

The bigger the concurrency is, the better. If you turn on the concurrency without thinking, the IP will be temporarily banned, or the network segment will be blocked. What we need to do is to intelligently sense the pressure resistance of the target site so that the request rate is always near the "critical point that the other party can withstand".

3.1 Basic network configuration

# scrapy_settings.py
# 总并发数:32是一个比较安全的起步值
CONCURRENT_REQUESTS = 32
# 单域名并发:8 一般不会触发反爬,可根据目标站实际测出阈值
CONCURRENT_REQUESTS_PER_DOMAIN = 8
# 单IP并发(有代理池时可以适度放大)
CONCURRENT_REQUESTS_PER_IP = 4

# DNS缓存,避免每次请求都解析域名
DNSCACHE_ENABLED = True
DNSCACHE_SIZE = 10000
DNS_TIMEOUT = 30

# 仅对可恢复的状态码进行重试(比如服务器临时错误、限流)
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
RETRY_PRIORITY_ADJUST = -1    # 重试请求优先级降低,不影响正常请求

3.2 “Smarter” adaptive speed limit

Scrapy comes withAutoThrottleIt is only adjusted based on response latency, but sometimes low latency may be due to the other party returning empty content or an error page. We can add the two dimensions of success rate and response time trend to make the speed limiter more "smart":

# adaptive_throttle.py
import statistics
from collections import defaultdict, deque
from scrapy.downloadermiddlewares.throttle import AutoThrottle

class SmartAutoThrottle(AutoThrottle):
    def __init__(self, crawler):
        super().__init__(crawler)
        # 为每个域名维护最近50条请求的记录(响应延迟 + 是否成功)
        self.stats = defaultdict(lambda: deque(maxlen=50))

    def _adjust_delay(self, slot, latency, response=None):
        domain = slot.key
        self.stats[domain].append({
            'latency': latency,
            'success': 200 <= response.status < 400 if response else True
        })

        # 至少收集到10条数据才开始调整
        if len(self.stats[domain]) < 10:
            return

        recent = list(self.stats[domain])[-10:]
        # 1. 根据最近10条的成功率调整
        success_rate = sum(1 for r in recent if r['success']) / 10
        if success_rate < 0.8:       # 成功率低于80%,放慢脚步
            slot.delay = min(slot.delay * 1.3, 60)
        elif success_rate > 0.95:    # 成功率高于95%,可以尝试提速
            slot.delay = max(slot.delay * 0.9, 0.5)

        # 2. 根据平均响应时间的趋势调整
        avg_latency = sum(r['latency'] for r in recent) / 10
        overall_latency = statistics.mean(r['latency'] for r in self.stats[domain])
        if avg_latency > overall_latency * 1.2:   # 近期响应变慢,降速
            slot.delay = min(slot.delay * 1.2, 60)
        elif avg_latency < overall_latency * 0.8: # 响应变快,可以加速
            slot.delay = max(slot.delay * 0.9, 0.5)

# 在 settings.py 中用我们自己的 SmartAutoThrottle 替换原生中间件
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.throttle.AutoThrottle': None,
    'your_project.adaptive_throttle.SmartAutoThrottle': 800,
}

In this way, the crawler will be like an experienced driver, automatically adjusting the throttle according to the road conditions, and will neither run the red light (be blocked) nor drive at a slow speed (low efficiency).


4. Observability and fault tolerance - the last line of defense for crawlers

The crawler suddenly crashed after running tens of thousands of entries, but only got a vague error log? This is when the health check and alert system comes in handy. We can monitor the health status of the crawler in real time through simple Prometheus indicators + log alerts.

4.1 Health check based on Prometheus

# health_check.py
import time
import logging
from collections import deque
from scrapy import signals
from pydispatch import dispatcher
from prometheus_client import Gauge, start_http_server

logger = logging.getLogger(__name__)

# 定义三个仪表盘指标,可在本地 8000 端口查看
MEM_GAUGE = Gauge('scrapy_mem_rss_mb', 'Memory RSS usage', ['spider'])
CPU_GAUGE = Gauge('scrapy_cpu_percent', 'CPU usage', ['spider'])
ERR_GAUGE = Gauge('scrapy_error_rate', 'Error rate (last 100 requests)', ['spider'])

class HealthCheckExtension:
    def __init__(self, crawler):
        self.crawler = crawler
        self.spider_name = None
        self.last_errs = deque(maxlen=100)   # 记录最近100次请求的成功/失败
        self.check_interval = 60             # 每60秒汇总一次
        self.last_check = time.time()

        # 启动 Prometheus HTTP 服务,访问 http://localhost:8000 即可查看指标
        start_http_server(8000)
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.response_received, signals.response_received)
        dispatcher.connect(self.request_dropped, signals.request_dropped)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def spider_opened(self, spider):
        self.spider_name = spider.name

    def response_received(self, response, request, spider):
        # 记录该请求是否成功(2xx 或 3xx 视为成功)
        self.last_errs.append(200 <= response.status < 400)
        self._check_and_alert()

    def request_dropped(self, request, response, spider):
        # 请求被丢弃也视为失败
        self.last_errs.append(False)
        self._check_and_alert()

    def _check_and_alert(self):
        current = time.time()
        if current - self.last_check < self.check_interval:
            return
        self.last_check = current

        import psutil
        p = psutil.Process()
        # 更新内存和CPU指标
        MEM_GAUGE.labels(spider=self.spider_name).set(p.memory_info().rss // 1024 // 1024)
        CPU_GAUGE.labels(spider=self.spider_name).set(p.cpu_percent(interval=0.1))

        # 计算最近50次请求的错误率
        if len(self.last_errs) >= 50:
            err_rate = 1 - sum(self.last_errs) / len(self.last_errs)
            ERR_GAUGE.labels(spider=self.spider_name).set(err_rate)

            if err_rate > 0.1:
                # 实际项目中可以换成邮件、钉钉/企业微信机器人发送告警
                logger.error(f"🚨 告警:错误率过高!当前错误率:{err_rate:.2%}")

Now, not only can you see beautiful dashboards on Grafana, but you can also receive notifications when errors occur. You no longer have to get up in the middle of the night to manually restart the crawler.


5. Daoman’s “best practices” for production environments

Core configuration list

Integrating all the above optimizations, a set of basic configurations that can be directly put into production is formed. You can adjust the parameters as needed, but remember the following three iron rules:

  1. Jog in small steps and gradually increase the amount At the beginning of the new crawler, set the total concurrency to 4, single domain name 2, run for 2 hours to confirm that it is not blocked, and then gradually increase it. It’s better to be slower than to be on the blacklist.

  2. Trust the data, not “conventions” Don't rely entirely on the delays specified in robots.txt; some sites actually require more stringent requirements. useSmartAutoThrottleLet the crawler "feel" the target's tolerance.

  3. Do a good job of resume crawling from breakpoints and deduplication and persistence URL queues and deduplication sets must be capable of persistent storage (such as Redis or disk queues), so that even if the machine suddenly loses power, it can continue from where it left off after restarting.


🔗 Recommended related tutorials

🏷️ tag cloud:大规模爬虫 性能优化 内存管理 网络优化 并发控制 Scrapy调优