Complete Guide to Scrapy AutoThrottle
📂 Stage: Stage 3 - Offensive and Defense Drills (Middleware and Anti-Climbing) 🔗 Related chapters: Downloader Middleware · 代理IP池集成 · 反爬对抗实战
Table of contents
What is AutoThrottle? {What is #autothrottle}
AutoThrottle is Scrapy's built-in download middleware, specifically designed to replace traditional static delays.DOWNLOAD_DELAY. It will dynamically adjust the waiting time between each request and the number of simultaneous requests based on the real-time response of the target website, allowing your crawler to behave like a "polite human":
- Will not wait rigidly and waste time in vain
- It will not send out packets like crazy to overwhelm the server or trigger anti-climbing.
You can think of it as a cruise system that automatically adjusts the vehicle speed according to road conditions - go faster when the road conditions are good, automatically slow down when there is a traffic jam, and always maintain a safe distance.
Why not just use static delay?
If all requests use a fixed delay, three embarrassing situations will occur:
- Too bad during off-peak periods: The website load is light and the response is fast, but your crawler is still "fishing" every 5 seconds, which seriously slows down the efficiency.
- Peak periods add chaos: The website itself is already struggling, and a fixed delay will not cause the server to slow down. It will still maintain the original request rhythm, which can easily trigger current limiting or bans.
- New websites are difficult to predict: Every time you change the target site, you have to guess a "suitable" delay. If you guess too low, you will be blocked easily; if you guess too high, it will be a waste of time.
The emergence of AutoThrottle just solves these problems - allowing the crawler to learn to decide the speed "depending on the situation".
Working Principle (Simplified Version)
The core of AutoThrottle is not complicated. You can quickly understand its operating logic using the following three steps:
-
Starting Stage as you set
AUTOTHROTTLE_START_DELAY(Initial delay) Make the first request. -
Continuous Learning During the crawling process, AutoThrottle will silently record the average response time of all requests under each domain name, just like learning "how fast this website usually responds."
-
Dynamic speed adjustment The actual waiting time will be calculated like this: Target latency ≈ average response time ÷ target number of concurrencies (The target concurrency number defaults to 1.0, which means that for a single site, it is roughly "wait until the previous request has a response before sending the next one") The final delay will be strictly limited to
START_DELAYandMAX_DELAYbetween.
In addition, AutoThrottle will also comply with the global and single-domain concurrency upper limit and will not let your crawler "let itself go".
Tips: The larger the target concurrency number, the more requests a single site will issue at the same time, which is more efficient but also more risky; conversely, the smaller the number, the more conservative it will be.
Core configuration parameters
Getting started with AutoThrottle is easy, justsettings.pyJust turn on the switch and fine-tune a few key parameters.
Tips for avoiding pitfalls
-
Don’t set it up separately anymore
DOWNLOAD_DELAY
AutoThrottle will take over the request delay. When the two exist at the same time, they may interfere with each other and make the speed uncontrollable. -
The target concurrency number should not exceed 1/3 of the global concurrency For example global
CONCURRENT_REQUESTS = 16, it is recommended that the target concurrency number be controlled within 5 to prevent a single domain name from eating up all connection resources. -
Do not set the maximum delay too large If the delay reaches more than 120 seconds, there is a high probability that the server has "blacklisted" you. At this time, it is better to actively suspend the task and change the IP instead of continuing to wait.
Scenario-based advanced configuration
Different websites have completely different "inclusiveness". We can apply the following sets of templates according to the target characteristics.
1. Conservative mode (suitable for high-protection websites: Weibo, Zhihu, Xiaohongshu, etc.)
2. Balanced mode (suitable for ordinary websites: news sites, blog sites, small and medium-sized e-commerce)
3. Aggressive mode (suitable for public APIs, data interfaces, and when you are confident in your own IP pool)
No matter which mode you choose, it is recommended to run with conservative parameters first, and then gradually relax after observing stability.
The simplest custom speed limit
If the built-in AutoThrottle can't meet your needs - for example, you want to distinguish between requests for ordinary pages and AJAX interfaces, or add a little "random jitter" to each request to avoid overly regular access behavior - then you don't need to rewrite the entire middleware from scratch, just inherit the built-in AutoThrottle and fine-tune the core methods.
The example below shows two common techniques: reducing latency for API requests and adding ±20% random float to all requests.
Enable custom middleware
existsettings.pyJust replace the built-in version in and keep the priority the same:
Frequently Asked Questions and Best Practices
FAQ
1. The crawler speed is still very slow, what should I do?
Check one by one from these directions:
- Is it still retained?
DOWNLOAD_DELAYset up? Delete it now, it will fight AutoThrottle. AUTOTHROTTLE_TARGET_CONCURRENCYIs it set too conservatively? You can try raising it to 2-5 first.- global
CONCURRENT_REQUESTSCould it be that your neck is stuck? For example, you can always use the default 16. If the machine performance allows and the website can bear it, you can increase it to 32 or even 64.
2. Even though AutoThrottle is turned on, the IP is still blocked?
- Don’t expect AutoThrottle to solve all anti-climb problems single-handedly. Be sure to work together with Agent IP Pool, User-Agent Random Rotation, Cookie Pool and other means.
- Check if the high voltage line is triggered: watch the delay in the debug log, if the delay is suppressed for a long time
MAX_DELAYNearby, it means that the server is very dissatisfied. It is recommended to pause the task for 10 to 30 minutes, change the IP, and then resume it. - If the website is extremely sensitive to frequency, decisively switch to conservative mode, reduce the target number of concurrencies to 0.5, and achieve "absolute serialization" of a single domain name.
3. The delay number of AutoThrottle keeps jumping, is this normal?
**Totally normal! ** Because it is a dynamic algorithm based on real-time statistics, there will be less data accumulation at the beginning and the fluctuation will be more obvious. After running twenty or thirty pages, the delay will slowly converge to a relatively stable range.
Best Practices
-
Use conservative mode + turn on debug logs in the early stages of development First, fully understand the response habits of the target website, and then consider speeding up.
-
Be sure to add random jitter Whether using the built-in or customized version, making the interval between each request a little "humanized" is the most economical way to avoid being identified by a fixed-frequency detection system.
-
Try to control the number of concurrent connections for a single domain name within 4 Unless you are capturing a service that is clearly defined as a "public API", multiple concurrency can easily be blocked as a script tool.
-
Regularly save crawling status Turn on Scrapy
JOBDIR, so that even if you are suddenly banned and paused, you can continue running from the breakpoint after changing the IP, avoiding starting over from the beginning. -
Closely monitor response status codes If you receive continuously
429(requested too frequently) or403(No access), indicating that the other party has begun to counterattack, and the entire strategy should be suspended and adjusted immediately.
💡 Summary of core points AutoThrottle is one of the most practical and cheapest anti-crawling assistants in Scrapy. Don’t invent complex speed limiting algorithms as soon as you start, give priority to using and tuning the built-in AutoThrottle. Together with proxy, UA rotation, and cookie pool, it is enough to solve 80% of the anti-crawling problems in daily crawlers. When encountering more special scenarios, the most efficient way is to extend a little custom logic based on inheritance and proceed steadily.

