A complete guide to Scrapy anti-crawling combat - Detailed explanation of verification code cracking and all-round anti-detection technology
📂 Stage: Stage 3 - Offensive and Defense Drills (Middleware and Anti-Climbing) 🔗 Related chapters: Downloader Middleware · Selenium与Playwright集成 · 代理IP池集成
When the crawler encounters "403 Forbidden" or "Please enter the verification code", it means that you have entered the core area of anti-crawling confrontation. This tutorial will help you systematically master key technologies such as verification code cracking, IP rotation, request header camouflage, browser fingerprint hiding, and human behavior simulation in Scrapy, so that your crawler can easily move between attack and defense.
Overview of anti-crawling mechanism
Modern websites usually build a four-layer anti-crawling system, from shallow to deep layers of defense. Only by clearly understanding these detection levels can we deploy cracking solutions in a targeted manner:
Below we break down the countermeasures at each level one by one.
Core offensive and defensive technology actual combat
1. Intelligent IP rotation and ban avoidance
Pain Point: IP blocking is the most common anti-crawling trigger. Simple random rotation often cannot cope with fine-grained bans - you may be blocked the moment you visit more frequently.
A smart solution is to establish a scoring system for each proxy IP, automatically prioritize based on the number of successes/failures, cooling time, and automatically unblock the IP after it is blocked.
In this way, call before each requestget_best_proxy(), you can avoid the IP that has just been blocked and give priority to using proxies with a high success rate.
2. Request header and browser fingerprint anti-detection
1. Dynamic request header generator
A static User-Agent is easily identifiable as a crawler. usefake_useragentThe library plus randomized Accept, Accept-Language and other fields can make each request look like a different real browser.
2. Selenium/Playwright basic anti-detection script
When the website passes the testnavigator.webdriverWhen using other attributes to determine whether it is an automated tool, we need to execute JavaScript code to hide these characteristics. Below is a generic script that overrides key properties and protects native functions from detection.
Injecting this script as soon as the browser is opened can effectively circumvent most anti-crawling mechanisms based on WebDriver attribute detection.
3. Quick Start with Verification Code Recognition
CAPTCHA is a typical representative of the content verification layer. We use different cracking strategies for different types of verification codes.
1. Simple character verification code: preprocessing + OCR
For character verification codes with less background noise, usepytesseractA higher recognition rate can be achieved after simple preprocessing with OpenCV.
If you encounter more complex verification codes, it is recommended to use
ddddocrLibrary, which has better recognition effects for Chinese, slider, click and other types.
2. Slider verification code: simulate human sliding trajectory
The core of the slider captcha is how anthropomorphic the trajectory is. by SeleniumActionChainsThe trajectory of segmented acceleration-deceleration + random jitter is generated, which can effectively pass the verification.
4. Frequency Limitation and Human Behavior Simulation
Even if the IP and request headers are well disguised, too regular access frequency will expose the identity of the crawler. We need to simulate human activity based on time periods and add random page dwell time.
Willget_delay()Inserted between each downloader request, your crawler cadence will be more like real users.
Legal Compliance and Best Practices
Technology is a double-edged sword, and you must keep the legal and moral bottom line when using anti-climbing countermeasures.
Compliance Red Line
- Respect Copyright: Only capture public data to avoid commercial abuse or infringement of intellectual property rights.
- Abide by the Agreement: Strictly follow
robots.txt, website terms of service and developer specifications. - Data Security: Comply with the "Personal Information Protection Law" and do not store or disseminate any sensitive personal information.
- Resource constraints: Control the number of concurrent requests to avoid excessive pressure on the target server.
Best Practices
- Prioritize API use: If the target provides an official public API, call it first instead of the crawler.
- Clear the identity of the crawler: Add the crawler name and contact information to the User-Agent to maintain transparency.
- Smart Retry: Automatically extend the retry interval when encountering 429 (speed limit) or 503 (service unavailable).
- Continuous monitoring: Record the error rate and response time, and dynamically adjust the anti-climbing strategy based on actual feedback.
Summarize
Anti-climbing confrontation is essentially an offensive and defensive game, and there is no one-size-fits-all solution. You need to build a layered defense system: IP rotation → request header forgery → behavioral simulation → browser fingerprint hiding, and switch strategies in real time based on monitoring results during operation. More importantly, always put legal and ethical considerations first and let technology serve legitimate data collection compliance needs.
💡 Core tool recommendations:
fake_useragent(Request header disguise),redis(Distributed/IP Pool),playwright-stealth(browser anti-detection),ddddocr(More powerful captcha recognition).
🏷️ tag cloud:Scrapy 反爬虫 验证码破解 IP轮换 请求头伪造 浏览器指纹 反检测 爬虫安全

