Performance optimization and debugging

#Performance optimization and debugging: concurrency tuning, log analysis

📂 Stage: Stage 4 - Practical Exercise (Project Development)

Have you ever encountered this: the crawler ran very happily when it was first written, but once it wants to capture 1,000 pieces of data, the speed drops to the point of collapse, IP bans are imposed at every turn, and the bug cannot be found after searching through the code? 😤 In fact, this type of problem usually has nothing to do with code logic, but rather that the "rhythm" of the crawler is not adjusted correctly, and that you don't know where to check if something goes wrong. This note will help you solve the two most troublesome things about Scrapy crawlers - "cannot run fast" and "difficulty troubleshooting".


1. Core tuning: control concurrency to make the crawler fast and stable

Scrapy's out-of-the-box concurrency settings are very "Buddhist", which is to ensure that it will not overturn on all kinds of websites. But if you want to crawl a large public site (such as GitHub Wiki, the old version of Douban Reading), you can boldly adjust the following parameters to find a balance between throughput, anti-blocking risk, and server pressure.

1.1 Scrapy’s “three-layer speed limit” logic

Scrapy uses three layers of rules to limit the sending speed of requests (from loose to strict):

  1. Global concurrency count: No matter who the target is, the maximum number of requests allowed to be sent at the same time.
  2. Single domain name/single IP concurrency: The upper limit of concurrency for the same domain name** (or the same IP address).
  3. Download delay & automatic speed limit: The minimum waiting time between adjacent requests, used to further smooth the request rhythm.

The general idea for tuning is: **Turn off the automatic speed limiter first, manually test the upper limit of concurrency that the target station can bear, and then turn on the automatic speed limiter to prevent blockades.

1.2 Key configuration code (with scene description)

in the projectsettings.pyAdd or modify the following configuration items in . The highlighted values ​​can be fine-tuned according to the actual situation:

# settings.py

# ✅ 全局并发请求数:
# 公开大站建议 16~64,自建测试环境可以打到 100+
# 注意:最好不要超过 CPU 核心数 ×2 太多,否则调度反而会拖慢
CONCURRENT_REQUESTS = 32

# ✅ 单域名并发数:这是最容易触发反爬的地方!
# 像 GitHub Gist 这样的大站可以设 8~16;一般中小型博客建议 2~4
CONCURRENT_REQUESTS_PER_DOMAIN = 8

# ✅ (可选)按 IP 限制:多代理场景或目标站有多个 IP 集群时很有用
# CONCURRENT_REQUESTS_PER_IP = 4

# ✅ 下载延迟:手动测试阶段可以暂时设为 0,开了自动限速后这个值会被覆盖
DOWNLOAD_DELAY = 0.5

# ✅ 自动限速(AUTOTHROTTLE):强烈建议测试完就开启!
# 它会根据目标站的响应情况(超时、403/500 等)动态调整延迟
AUTOTHROTTLE_ENABLED = True
# 初始延迟,给第一次请求一个小缓冲
AUTOTHROTTLE_START_DELAY = 5
# 最大延迟,防止卡死(设太长影响效率,60~120 秒比较合适)
AUTOTHROTTLE_MAX_DELAY = 60
# 自动限速的目标并发数,可以设为单域名并发的 2/3 左右
AUTOTHROTTLE_TARGET_CONCURRENCY = 5.0

💡 Tip: If the target site responds quickly, you can putAUTOTHROTTLE_TARGET_CONCURRENCYTurn it up a little, and vice versa, turn it down a little.


2. Quick troubleshooting: use logs to locate any problems

Many novices only stare at the few lines of error reports on the console when troubleshooting. In fact, the detailed log file is the "flying black box" of the crawler - anti-crawling information, request timeout, parsing errors, repeated crawling... all the clues are in it.

2.1 Let the log output be in place first

The default console only outputsINFOand above, it is recommended to change it to the following when debugging:

# settings.py

# ✅ 日志级别:DEBUG 信息最全,INFO 日常够用
LOG_LEVEL = 'DEBUG'
# ✅ 将日志写入文件,方便后续查找分析
LOG_FILE = 'scrapy_spider.log'
# ✅ (可选)关掉控制台输出,只看日志文件
# LOG_STDOUT = False

2.2 Use 3 commands to quickly analyze logs (common to Linux / macOS / WSL)

No need for any complicated scripts, just use the ones that come with the systemgrepandwcYou can quickly get key data:

# 🟥 先看最严重的情况:统计 ERROR 级别的日志行数
grep "ERROR" scrapy_spider.log | wc -l

# 🟨 再看警告信息:例如 DNS 解析失败、robots.txt 限制等
grep "WARNING" scrapy_spider.log | head -20   # 只看前20条,避免刷屏

# ✅ 最后关心爬取效率:
# 统计成功抓取的页面数(状态码 200/302)
grep "Crawled (200)" scrapy_spider.log | wc -l
# 统计被去重过滤掉的重复页面
grep "Filtered duplicate" scrapy_spider.log | wc -l
# 想算总耗时?找第一条和最后一条日志的时间戳做个减法就行

Common log keywords & solution directions

Log keywordsPossible causesQuick solution ideas
403 ForbiddenAnti-crawling triggered (UA/IP/Cookie)Change proxy, add User-Agent pool, add Cookie
TimeoutErrorWebsite response is slow, or concurrency is too highTurn on automatic speed limit and reduce the number of concurrencies per domain name
Filtered duplicateToo manyThere are duplicates in the starting URL, or there is a problem with the deduplication rulesCheckstart_urls, confirm whether the deduplication key is reasonable
ItemErrorError parsing fieldUseresponse.textPrint page source code step by step debugging

3. Practical summary: Remember these 3 tuning rules

Performance optimization is by no means "full of concurrency" and then it's done, but a cycle of "monitoring → analysis → adjustment → monitoring again". Here are three rules that novices can follow directly:

  1. Test the limit manually first: Turn off the automatic speed limit, set the download delay to 0, adjust the number of concurrent single domain names from low to high until 403 / 500 / timeout occurs, and then take 70% to 80% of the stable value as the final configuration.
  2. Use automatic speed limit to save money: Leave a margin for manual configuration, and then turn on automatic speed limit. This can not only cope with temporary website fluctuations, but also automatically speed up when the pressure is low.
  3. Logs always come first: No matter whether the climb is fast or slow, file logs must be enabled so that you can trace back to the past when something goes wrong.

💡 Finally, I would like to say one more thing: The highest goal of a crawler is not "fastest", but "sustainable" - Don't bring unnecessary trouble to the target site or your own agent pool for the sake of temporary speed.


🔗Official extended reading (recommended collection)