Complete Guide to Scrapy AutoThrottle

📂 Stage: Stage 3 - Offensive and Defense Drills (Middleware and Anti-Climbing) 🔗 Related chapters: Downloader Middleware · 代理IP池集成 · 反爬对抗实战

Table of contents


What is AutoThrottle? {What is #autothrottle}

AutoThrottle is Scrapy's built-in download middleware, specifically designed to replace traditional static delays.DOWNLOAD_DELAY. It will dynamically adjust the waiting time between each request and the number of simultaneous requests based on the real-time response of the target website, allowing your crawler to behave like a "polite human":

  • Will not wait rigidly and waste time in vain
  • It will not send out packets like crazy to overwhelm the server or trigger anti-climbing.

You can think of it as a cruise system that automatically adjusts the vehicle speed according to road conditions - go faster when the road conditions are good, automatically slow down when there is a traffic jam, and always maintain a safe distance.

Why not just use static delay?

If all requests use a fixed delay, three embarrassing situations will occur:

  1. Too bad during off-peak periods: The website load is light and the response is fast, but your crawler is still "fishing" every 5 seconds, which seriously slows down the efficiency.
  2. Peak periods add chaos: The website itself is already struggling, and a fixed delay will not cause the server to slow down. It will still maintain the original request rhythm, which can easily trigger current limiting or bans.
  3. New websites are difficult to predict: Every time you change the target site, you have to guess a "suitable" delay. If you guess too low, you will be blocked easily; if you guess too high, it will be a waste of time.

The emergence of AutoThrottle just solves these problems - allowing the crawler to learn to decide the speed "depending on the situation".


Working Principle (Simplified Version)

The core of AutoThrottle is not complicated. You can quickly understand its operating logic using the following three steps:

  1. Starting Stage as you setAUTOTHROTTLE_START_DELAY(Initial delay) Make the first request.

  2. Continuous Learning During the crawling process, AutoThrottle will silently record the average response time of all requests under each domain name, just like learning "how fast this website usually responds."

  3. Dynamic speed adjustment The actual waiting time will be calculated like this: Target latency ≈ average response time ÷ target number of concurrencies (The target concurrency number defaults to 1.0, which means that for a single site, it is roughly "wait until the previous request has a response before sending the next one") The final delay will be strictly limited toSTART_DELAYandMAX_DELAYbetween.

In addition, AutoThrottle will also comply with the global and single-domain concurrency upper limit and will not let your crawler "let itself go".

Tips: The larger the target concurrency number, the more requests a single site will issue at the same time, which is more efficient but also more risky; conversely, the smaller the number, the more conservative it will be.


Core configuration parameters

Getting started with AutoThrottle is easy, justsettings.pyJust turn on the switch and fine-tune a few key parameters.

# 1. 核心开关(强烈建议在生产环境和调试阶段都开启)
AUTOTHROTTLE_ENABLED = True

# 2. 延迟范围控制(最重要的两个参数)
AUTOTHROTTLE_START_DELAY = 3   # 初始请求间隔(秒)。普通站点 1-2,强反爬站点 5-10
AUTOTHROTTLE_MAX_DELAY = 60    # 最大请求间隔(秒)。防止异常状况下无限等待

# 3. 目标并发数(决定你是“激进派”还是“稳重派”)
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# 例如设为 2.0,单站点最多可以同时发出 2 个请求,效率翻倍但被禁风险也会增加

# 4. 调试开关(开发时务必打开,实时监控限速情况)
AUTOTHROTTLE_DEBUG = True
# 开启后,控制台会持续输出当前延迟、响应时间等统计信息

Tips for avoiding pitfalls

  • Don’t set it up separately anymoreDOWNLOAD_DELAY
    AutoThrottle will take over the request delay. When the two exist at the same time, they may interfere with each other and make the speed uncontrollable.

  • The target concurrency number should not exceed 1/3 of the global concurrency For example globalCONCURRENT_REQUESTS = 16, it is recommended that the target concurrency number be controlled within 5 to prevent a single domain name from eating up all connection resources.

  • Do not set the maximum delay too large If the delay reaches more than 120 seconds, there is a high probability that the server has "blacklisted" you. At this time, it is better to actively suspend the task and change the IP instead of continuing to wait.


Scenario-based advanced configuration

Different websites have completely different "inclusiveness". We can apply the following sets of templates according to the target characteristics.

1. Conservative mode (suitable for high-protection websites: Weibo, Zhihu, Xiaohongshu, etc.)

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 8
AUTOTHROTTLE_MAX_DELAY = 120
AUTOTHROTTLE_TARGET_CONCURRENCY = 0.5   # 极度保守:发一个请求,等大约 2 倍响应时间,再发下一个
CONCURRENT_REQUESTS = 2                 # 全局同时最多 2 个请求
CONCURRENT_REQUESTS_PER_DOMAIN = 1      # 单个域名严格串行,绝不“多嘴”

2. Balanced mode (suitable for ordinary websites: news sites, blog sites, small and medium-sized e-commerce)

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0   # 单站点允许 2 个请求同时跑
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 4

3. Aggressive mode (suitable for public APIs, data interfaces, and when you are confident in your own IP pool)

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.5
AUTOTHROTTLE_MAX_DELAY = 15
AUTOTHROTTLE_TARGET_CONCURRENCY = 5.0   # 单站点同时发出 5 个请求
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 8

No matter which mode you choose, it is recommended to run with conservative parameters first, and then gradually relax after observing stability.


The simplest custom speed limit

If the built-in AutoThrottle can't meet your needs - for example, you want to distinguish between requests for ordinary pages and AJAX interfaces, or add a little "random jitter" to each request to avoid overly regular access behavior - then you don't need to rewrite the entire middleware from scratch, just inherit the built-in AutoThrottle and fine-tune the core methods.

The example below shows two common techniques: reducing latency for API requests and adding ±20% random float to all requests.

import random
from scrapy.downloadermiddlewares.autothrottle import AutoThrottle

class CustomAutoThrottle(AutoThrottle):
    def _adjust_delay(self, slot, delay):
        """
        在原始延迟计算基础上,加入自己的逻辑:
        1. 若请求发往包含 'api' 或 'ajax' 的域名,则降低延迟
        2. 对最终延迟加上 ±20% 的随机抖动,避免形成固定频率而被反爬识别
        """
        # 先获取 AutoThrottle 原本计算出的基础延迟
        base_delay = super()._adjust_delay(slot, delay)

        domain = slot.domain

        # 1. 特殊域名降延迟
        if 'api' in domain or 'ajax' in domain:
            base_delay *= 0.6   # API 请求只保留 60% 的延迟

        # 2. 随机抖动(最有效的反规律手段之一)
        jitter = random.uniform(0.8, 1.2)
        final_delay = base_delay * jitter

        # 3. 再次套上安全边界,防止过于极端
        final_delay = max(self.start_delay * 0.5, min(final_delay, self.max_delay))

        return final_delay

Enable custom middleware

existsettings.pyJust replace the built-in version in and keep the priority the same:

DOWNLOADER_MIDDLEWARES = {
    # 禁用内置 AutoThrottle
    'scrapy.downloadermiddlewares.autothrottle.AutoThrottle': None,
    # 启用自己的定制版本(优先级沿用 400)
    'myproject.middlewares.CustomAutoThrottle': 400,
}

Frequently Asked Questions and Best Practices

FAQ

1. The crawler speed is still very slow, what should I do?

Check one by one from these directions:

  • Is it still retained?DOWNLOAD_DELAYset up? Delete it now, it will fight AutoThrottle.
  • AUTOTHROTTLE_TARGET_CONCURRENCYIs it set too conservatively? You can try raising it to 2-5 first.
  • globalCONCURRENT_REQUESTSCould it be that your neck is stuck? For example, you can always use the default 16. If the machine performance allows and the website can bear it, you can increase it to 32 or even 64.

2. Even though AutoThrottle is turned on, the IP is still blocked?

  • Don’t expect AutoThrottle to solve all anti-climb problems single-handedly. Be sure to work together with Agent IP Pool, User-Agent Random Rotation, Cookie Pool and other means.
  • Check if the high voltage line is triggered: watch the delay in the debug log, if the delay is suppressed for a long timeMAX_DELAYNearby, it means that the server is very dissatisfied. It is recommended to pause the task for 10 to 30 minutes, change the IP, and then resume it.
  • If the website is extremely sensitive to frequency, decisively switch to conservative mode, reduce the target number of concurrencies to 0.5, and achieve "absolute serialization" of a single domain name.

3. The delay number of AutoThrottle keeps jumping, is this normal?

**Totally normal! ** Because it is a dynamic algorithm based on real-time statistics, there will be less data accumulation at the beginning and the fluctuation will be more obvious. After running twenty or thirty pages, the delay will slowly converge to a relatively stable range.

Best Practices

  1. Use conservative mode + turn on debug logs in the early stages of development First, fully understand the response habits of the target website, and then consider speeding up.

  2. Be sure to add random jitter Whether using the built-in or customized version, making the interval between each request a little "humanized" is the most economical way to avoid being identified by a fixed-frequency detection system.

  3. Try to control the number of concurrent connections for a single domain name within 4 Unless you are capturing a service that is clearly defined as a "public API", multiple concurrency can easily be blocked as a script tool.

  4. Regularly save crawling status Turn on ScrapyJOBDIR, so that even if you are suddenly banned and paused, you can continue running from the breakpoint after changing the IP, avoiding starting over from the beginning.

  5. Closely monitor response status codes If you receive continuously429(requested too frequently) or403(No access), indicating that the other party has begun to counterattack, and the entire strategy should be suspended and adjusted immediately.


💡 Summary of core points AutoThrottle is one of the most practical and cheapest anti-crawling assistants in Scrapy. Don’t invent complex speed limiting algorithms as soon as you start, give priority to using and tuning the built-in AutoThrottle. Together with proxy, UA rotation, and cookie pool, it is enough to solve 80% of the anti-crawling problems in daily crawlers. When encountering more special scenarios, the most efficient way is to extend a little custom logic based on inheritance and proceed steadily.