Performance optimization and debugging
#Performance optimization and debugging: concurrency tuning, log analysis
📂 Stage: Stage 4 - Practical Exercise (Project Development)
Have you ever encountered this: the crawler ran very happily when it was first written, but once it wants to capture 1,000 pieces of data, the speed drops to the point of collapse, IP bans are imposed at every turn, and the bug cannot be found after searching through the code? 😤 In fact, this type of problem usually has nothing to do with code logic, but rather that the "rhythm" of the crawler is not adjusted correctly, and that you don't know where to check if something goes wrong. This note will help you solve the two most troublesome things about Scrapy crawlers - "cannot run fast" and "difficulty troubleshooting".
1. Core tuning: control concurrency to make the crawler fast and stable
Scrapy's out-of-the-box concurrency settings are very "Buddhist", which is to ensure that it will not overturn on all kinds of websites. But if you want to crawl a large public site (such as GitHub Wiki, the old version of Douban Reading), you can boldly adjust the following parameters to find a balance between throughput, anti-blocking risk, and server pressure.
1.1 Scrapy’s “three-layer speed limit” logic
Scrapy uses three layers of rules to limit the sending speed of requests (from loose to strict):
- Global concurrency count: No matter who the target is, the maximum number of requests allowed to be sent at the same time.
- Single domain name/single IP concurrency: The upper limit of concurrency for the same domain name** (or the same IP address).
- Download delay & automatic speed limit: The minimum waiting time between adjacent requests, used to further smooth the request rhythm.
The general idea for tuning is: **Turn off the automatic speed limiter first, manually test the upper limit of concurrency that the target station can bear, and then turn on the automatic speed limiter to prevent blockades.
1.2 Key configuration code (with scene description)
in the projectsettings.pyAdd or modify the following configuration items in . The highlighted values can be fine-tuned according to the actual situation:
💡 Tip: If the target site responds quickly, you can put
AUTOTHROTTLE_TARGET_CONCURRENCYTurn it up a little, and vice versa, turn it down a little.
2. Quick troubleshooting: use logs to locate any problems
Many novices only stare at the few lines of error reports on the console when troubleshooting. In fact, the detailed log file is the "flying black box" of the crawler - anti-crawling information, request timeout, parsing errors, repeated crawling... all the clues are in it.
2.1 Let the log output be in place first
The default console only outputsINFOand above, it is recommended to change it to the following when debugging:
2.2 Use 3 commands to quickly analyze logs (common to Linux / macOS / WSL)
No need for any complicated scripts, just use the ones that come with the systemgrepandwcYou can quickly get key data:
Common log keywords & solution directions
3. Practical summary: Remember these 3 tuning rules
Performance optimization is by no means "full of concurrency" and then it's done, but a cycle of "monitoring → analysis → adjustment → monitoring again". Here are three rules that novices can follow directly:
- Test the limit manually first: Turn off the automatic speed limit, set the download delay to 0, adjust the number of concurrent single domain names from low to high until 403 / 500 / timeout occurs, and then take 70% to 80% of the stable value as the final configuration.
- Use automatic speed limit to save money: Leave a margin for manual configuration, and then turn on automatic speed limit. This can not only cope with temporary website fluctuations, but also automatically speed up when the pressure is low.
- Logs always come first: No matter whether the climb is fast or slow, file logs must be enabled so that you can trace back to the past when something goes wrong.
💡 Finally, I would like to say one more thing: The highest goal of a crawler is not "fastest", but "sustainable" - Don't bring unnecessary trouble to the target site or your own agent pool for the sake of temporary speed.
🔗Official extended reading (recommended collection)

