A complete guide to Scrapy anti-crawling combat - Detailed explanation of verification code cracking and all-round anti-detection technology

📂 Stage: Stage 3 - Offensive and Defense Drills (Middleware and Anti-Climbing) 🔗 Related chapters: Downloader Middleware · Selenium与Playwright集成 · 代理IP池集成

When the crawler encounters "403 Forbidden" or "Please enter the verification code", it means that you have entered the core area of ​​anti-crawling confrontation. This tutorial will help you systematically master key technologies such as verification code cracking, IP rotation, request header camouflage, browser fingerprint hiding, and human behavior simulation in Scrapy, so that your crawler can easily move between attack and defense.

Overview of anti-crawling mechanism

Modern websites usually build a four-layer anti-crawling system, from shallow to deep layers of defense. Only by clearly understanding these detection levels can we deploy cracking solutions in a targeted manner:

LevelCore detection items
Request feature layerUser-Agent/IP frequency/Request header integrity/Cookie verification
Behavior feature layerAccess frequency/page dwell time/mouse trajectory/click pattern detection
Technical fingerprint layerJS execution detection/browser fingerprint identification/device fingerprint/network stack characteristics
Content verification layerDynamic content generation/verification code challenge/human-computer verification

Below we break down the countermeasures at each level one by one.


Core offensive and defensive technology actual combat

1. Intelligent IP rotation and ban avoidance

Pain Point: IP blocking is the most common anti-crawling trigger. Simple random rotation often cannot cope with fine-grained bans - you may be blocked the moment you visit more frequently.

A smart solution is to establish a scoring system for each proxy IP, automatically prioritize based on the number of successes/failures, cooling time, and automatically unblock the IP after it is blocked.

import time
import random
from collections import defaultdict, deque

class IntelligentIPManager:
    """智能IP管理器:评分 + 冷却 + 自动解封"""
    
    def __init__(self, proxy_list=None):
        self.proxy_list = proxy_list or []
        self.ip_stats = defaultdict(lambda: {
            'success': 0, 'failure': 0, 'score': 100,
            'last_used': 0, 'banned': False, 'ban_time': 0
        })
    
    def get_best_proxy(self):
        """综合评分选最优IP:成功率 > 冷却时间 > 基础分数"""
        scored = []
        for proxy, stats in self.ip_stats.items():
            if stats['banned']:
                # 自动解封(30分钟)
                if time.time() - stats['ban_time'] > 1800:
                    stats['banned'] = False
                else:
                    continue
            # 计算权重
            total = stats['success'] + stats['failure']
            success_rate = stats['success'] / max(1, total)
            cool_down = 1.0 if time.time() - stats['last_used'] > 300 else 0.7
            score = stats['score'] * success_rate * cool_down
            scored.append((proxy, score))
        
        if scored:
            return max(scored, key=lambda x: x[1])[0]
        return None
    
    def mark_banned(self, proxy):
        """标记IP被封"""
        self.ip_stats[proxy]['banned'] = True
        self.ip_stats[proxy]['ban_time'] = time.time()
        self.ip_stats[proxy]['score'] = 0
    
    def update_stats(self, proxy, success):
        """更新IP统计"""
        stats = self.ip_stats[proxy]
        stats['last_used'] = time.time()
        if success:
            stats['success'] += 1
            stats['score'] = min(100, stats['score'] + 5)
        else:
            stats['failure'] += 1
            stats['score'] = max(0, stats['score'] - 10)

In this way, call before each requestget_best_proxy(), you can avoid the IP that has just been blocked and give priority to using proxies with a high success rate.


2. Request header and browser fingerprint anti-detection

1. Dynamic request header generator

A static User-Agent is easily identifiable as a crawler. usefake_useragentThe library plus randomized Accept, Accept-Language and other fields can make each request look like a different real browser.

from fake_useragent import UserAgent
import random

class DynamicHeaders:
    """动态生成浏览器级请求头,覆盖Chrome/Safari/Firefox主流版本"""
    
    def __init__(self):
        self.ua = UserAgent()
        self.accepts = [
            'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
        ]
        self.languages = ['zh-CN,zh;q=0.9,en;q=0.8', 'en-US,en;q=0.9']
    
    def generate(self, url=None):
        """生成带随机性的完整请求头"""
        headers = {
            'User-Agent': self.ua.random,
            'Accept': random.choice(self.accepts),
            'Accept-Language': random.choice(self.languages),
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate'
        }
        return headers

2. Selenium/Playwright basic anti-detection script

When the website passes the testnavigator.webdriverWhen using other attributes to determine whether it is an automated tool, we need to execute JavaScript code to hide these characteristics. Below is a generic script that overrides key properties and protects native functions from detection.

// 通用浏览器反检测JS,隐藏webdriver、修改关键属性
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
Object.defineProperty(navigator, 'plugins', {
    get: () => [
        { filename: 'internal-pdf-viewer', description: 'Portable Document Format' }
    ]
});
Object.defineProperty(navigator, 'languages', { get: () => ['zh-CN', 'zh', 'en'] });

const originalToString = Function.prototype.toString;
Function.prototype.toString = function() {
    if (this === window.cdc_adoQpoasnfa76pfcZLmcfl_Array) {
        return 'function Array() { [native code] }';
    }
    return originalToString.call(this);
};

Injecting this script as soon as the browser is opened can effectively circumvent most anti-crawling mechanisms based on WebDriver attribute detection.


3. Quick Start with Verification Code Recognition

CAPTCHA is a typical representative of the content verification layer. We use different cracking strategies for different types of verification codes.

1. Simple character verification code: preprocessing + OCR

For character verification codes with less background noise, usepytesseractA higher recognition rate can be achieved after simple preprocessing with OpenCV.

import cv2
import numpy as np
import pytesseract

def preprocess_captcha(img_path):
    """灰度化 → 去噪 → 二值化 → 形态学闭运算"""
    img = cv2.imread(img_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    denoised = cv2.medianBlur(gray, 3)
    _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    kernel = np.ones((2, 2), np.uint8)
    return cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)

def ocr_captcha(img_path):
    processed = preprocess_captcha(img_path)
    # psm 8 表示单个单词模式,效果更佳
    return pytesseract.image_to_string(processed, config='--psm 8 --oem 3').strip()

If you encounter more complex verification codes, it is recommended to useddddocrLibrary, which has better recognition effects for Chinese, slider, click and other types.

2. Slider verification code: simulate human sliding trajectory

The core of the slider captcha is how anthropomorphic the trajectory is. by SeleniumActionChainsThe trajectory of segmented acceleration-deceleration + random jitter is generated, which can effectively pass the verification.

from selenium.webdriver.common.action_chains import ActionChains
import time
import random

def generate_track(distance):
    """模拟人类滑动轨迹:先加速后减速,带随机偏移"""
    track = []
    current = 0
    mid = distance * 4 / 5
    t = random.uniform(0.2, 0.3)
    v = 0
    while current < distance:
        a = 2 if current < mid else -3
        v0 = v
        v = v0 + a * t
        x = v0 * t + 0.5 * a * t * t
        current += x
        track.append(round(x))
    # 补回最后一小段距离
    track.append(round(distance - sum(track)))
    return track

def slide_captcha(driver, slider, track):
    """执行滑动操作"""
    ActionChains(driver).click_and_hold(slider).perform()
    for x in track:
        ActionChains(driver).move_by_offset(xoffset=x, yoffset=random.randint(-1, 1)).perform()
        time.sleep(random.uniform(0.01, 0.02))
    time.sleep(0.5)
    ActionChains(driver).release().perform()

4. Frequency Limitation and Human Behavior Simulation

Even if the IP and request headers are well disguised, too regular access frequency will expose the identity of the crawler. We need to simulate human activity based on time periods and add random page dwell time.

import time
import random
from datetime import datetime

class HumanSimulator:
    """模拟人类浏览行为:结合时段活跃度调整延迟 + 页面停留时间"""
    
    # 时段与活跃度(速度系数):(起始小时, 结束小时): (最低活跃系数, 最高活跃系数)
    activity_patterns = {
        (6, 9):   (0.3, 1.2),   # 清晨:低活跃,访问较快
        (9, 18):  (0.8, 0.9),   # 白天:高活跃,访问正常
        (18, 22): (0.6, 1.1),   # 傍晚:中等活跃,稍快
        (22, 6):  (0.2, 1.5)    # 深夜:低活跃,访问慢
    }
    
    @classmethod
    def get_delay(cls, base=1):
        """基于时段生成请求之间的延迟"""
        hour = datetime.now().hour
        for (start, end), (_, speed) in cls.activity_patterns.items():
            # 支持跨天区间(如 22点~次日6点)
            if start <= hour < end or (start > end and (hour >= start or hour < end)):
                adjusted = base * speed
                return max(0.1, adjusted + random.uniform(-0.3, 0.3))
        return base
    
    @classmethod
    def simulate_stay(cls):
        """模拟页面停留时间(10~60秒)"""
        time.sleep(random.uniform(10, 60))

Willget_delay()Inserted between each downloader request, your crawler cadence will be more like real users.


Technology is a double-edged sword, and you must keep the legal and moral bottom line when using anti-climbing countermeasures.

Compliance Red Line

  1. Respect Copyright: Only capture public data to avoid commercial abuse or infringement of intellectual property rights.
  2. Abide by the Agreement: Strictly followrobots.txt, website terms of service and developer specifications.
  3. Data Security: Comply with the "Personal Information Protection Law" and do not store or disseminate any sensitive personal information.
  4. Resource constraints: Control the number of concurrent requests to avoid excessive pressure on the target server.

Best Practices

  1. Prioritize API use: If the target provides an official public API, call it first instead of the crawler.
  2. Clear the identity of the crawler: Add the crawler name and contact information to the User-Agent to maintain transparency.
  3. Smart Retry: Automatically extend the retry interval when encountering 429 (speed limit) or 503 (service unavailable).
  4. Continuous monitoring: Record the error rate and response time, and dynamically adjust the anti-climbing strategy based on actual feedback.

Summarize

Anti-climbing confrontation is essentially an offensive and defensive game, and there is no one-size-fits-all solution. You need to build a layered defense system: IP rotation → request header forgery → behavioral simulation → browser fingerprint hiding, and switch strategies in real time based on monitoring results during operation. More importantly, always put legal and ethical considerations first and let technology serve legitimate data collection compliance needs.

💡 Core tool recommendations:fake_useragent(Request header disguise),redis(Distributed/IP Pool),playwright-stealth(browser anti-detection),ddddocr(More powerful captcha recognition).

🏷️ tag cloud:Scrapy 反爬虫 验证码破解 IP轮换 请求头伪造 浏览器指纹 反检测 爬虫安全