Tutorial on sliding verification code gap recognition based on deep learning

1. Introduction

Friends who do crawlers or automated testing must have been tortured by sliding verification codes such as Jiexian. Traditional edge detection and contour matching to find gaps either rely on fixed lighting and border styles, or you have to start over if you change the website slightly. **Why not use deep learning? ** The target detection model can automatically "see" the characteristics of the gap - no matter how complex the background is or how weird the shape of the gap, the generalization ability is much stronger than traditional methods.

This article uses the classic YOLOv3 to take you step by step through the complete gap identification process: from data collection and annotation to model training and deployment, and finally outputs a gap positioning API that can be directly called.

2. Preparation

2.1 Clone the code repository

This time, we directly use the YOLOv3 adapted version warehouse maintained by the open source community, eliminating the trouble of building it from scratch:

git clone https://github.com/Python3WebSpider/DeepLearningSlideCaptcha2.git
cd DeepLearningSlideCaptcha2

2.2 Install dependencies

in warehouserequirements.txtThe core dependencies have been configured, but in order to avoid network and version conflicts, it is recommended to complete the following two steps before installing all dependencies:

  1. Python environment: Use conda to create a 3.8-3.11 virtual environment (PyTorch 2.x is also compatible with this warehouse, but the PyTorch-related code in the training/detection script needs to be appropriately adjusted; novices are recommended to use 3.8 + PyTorch 1.8.x-1.13.x directly).
  2. Switch PyPI mirror: permanently change the source or temporarily add it-i https://pypi.tuna.tsinghua.edu.cn/simple

Official installation dependencies:

pip install -r requirements.txt

3. Basics of target detection: Why choose YOLOv3?

Sliding verification code gap recognition is essentially a single target detection task - finding the only "gap rectangular frame" in a picture.

Currently, the mainstream target detection algorithms are mainly divided into two categories:

  • Two‑Stage: For example, the R‑CNN series first "casts a net" to extract thousands of candidate frames, and then classifies and corrects the positions one by one. The advantage is accuracy, but the speed is too slow and is not suitable for high-frequency requests in automation scenarios.
  • One‑Stage: For example, YOLO and SSD directly divide the image into grids. Each grid predicts multiple bounding boxes and confidence levels at one time, and finally filters out the optimal results. The advantage is that it is as fast as flying. Although the accuracy is slightly lower, it is completely sufficient for scenes such as gap recognition.

This tutorial chooses YOLOv3 - it is like a "snake oil" in the YOLO series: much faster than v1/v2, much more accurate than v1, and has mature deployment and rich data.

4. Data preparation: the “fuel” of deep learning

4.1 Automatically collect verification code images

To find the gap, you must first have enough and complex verification code images. The warehouse provides a public demonstration site (https://captcha1.scrape.center/)的脚本`collect.py`:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os

# 自定义配置:收集数量、演示站地址、保存路径
COUNT = 100  # 建议至少收集200-500张,以保证模型精度
URL = 'https://captcha1.scrape.center/'
OUTPUT_DIR = 'data/captcha/images'

if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

for i in range(COUNT):
    # 每次循环都开关一次浏览器,避免会话污染导致验证码不刷新
    driver = webdriver.Chrome()
    try:
        driver.get(URL)
        wait = WebDriverWait(driver, 10)
        # 先点击登录按钮触发验证码
        login_btn = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.btn-primary')))
        login_btn.click()
        # 等待背景图片完全加载
        captcha_bg = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.geetest_slicebg')))
        time.sleep(2)  # 极验有时会延迟加载,多等1-2秒更稳妥
        # 只截图背景部分,减少无关干扰
        captcha_bg.screenshot(f'{OUTPUT_DIR}/captcha_{i}.png')
    finally:
        driver.quit()

💡 Tips: If you are targeting your own target website, please replace the CSS selector with the corresponding background image element. The larger the collection and the richer the diversity of gaps and backgrounds (different lighting, colors, gap shapes), the better the model's generalization ability.

4.2 Manually mark the gap position

YOLOv3 requires "annotated images" to train - that is, telling the model "which rectangular area in which image is the gap". It is recommended to use the lightweight open source labelImg tool:

pip install labelImg
labelImg

The labeling steps must be strictly unified, otherwise problems may easily occur during the training process:

  1. After opening the tool, first switch the saving format to PascalVOC (save as XML) (we will convert it to YOLO format later, so that the steps will be clearer).
  2. Set default label: Add only one line in the tool's "Edit → Predefined Classes"target(All notches use this label).
  3. Open what you have collecteddata/captcha/imagesdirectory, and set "Change Save Dir" to the newdata/captcha/annotationsTable of contents.
  4. For each picture, use the "Create RectBox" tool to accurately select the four corners of the gap, and select the default label.target, press "D" after saving to jump to the next one.

4.3 Convert to the format required by YOLOv3

The XML file of PascalVOC saves the upper left corner of the absolute coordinates(xmin, ymin)and lower right corner(xmax, ymax), and YOLOv3 requires:

  • Category ID (there is only 1 category here, so the ID is 0)
  • Normalized center point coordinates relative to the entire image(x_center, y_center)
  • Normalized width and height relative to the entire image(box_width, box_height)

provided by warehouseconvert.pyThe script can do all the conversions for you:

import xmltodict
import os

# 目录配置
XML_DIR = 'data/captcha/annotations'
LABEL_DIR = 'data/captcha/labels'
if not os.path.exists(LABEL_DIR):
    os.makedirs(LABEL_DIR)

def parse_single_xml(xml_path):
    with open(xml_path, 'r', encoding='utf-8') as f:
        xml_dict = xmltodict.parse(f.read())
    
    # 获取图片尺寸(用于归一化)
    img_w = int(xml_dict['annotation']['size']['width'])
    img_h = int(xml_dict['annotation']['size']['height'])
    
    # 获取缺口的绝对坐标
    bndbox = xml_dict['annotation']['object']['bndbox']
    xmin = int(bndbox['xmin'])
    ymin = int(bndbox['ymin'])
    xmax = int(bndbox['xmax'])
    ymax = int(bndbox['ymax'])
    
    # 转换为 YOLO 归一化格式
    x_center = ((xmin + xmax) / 2) / img_w
    y_center = ((ymin + ymax) / 2) / img_h
    box_w = (xmax - xmin) / img_w
    box_h = (ymax - ymin) / img_h
    
    return [x_center, y_center, box_w, box_h]

# 批量转换所有 XML 文件
for filename in os.listdir(XML_DIR):
    if not filename.endswith('.xml'):
        continue
    yolo_data = parse_single_xml(os.path.join(XML_DIR, filename))
    with open(os.path.join(LABEL_DIR, filename.replace('.xml', '.txt')), 'w', encoding='utf-8') as f:
        f.write(f'0 {" ".join(map(str, yolo_data))}')

5. Model training

5.1 Download pre-trained weights

Training YOLOv3 from scratch requires millions of images and weeks. Fortunately, we can directly use the weights pre-trained on the COCO dataset (the model has learned to recognize the basic characteristics of common objects), and then use our own verification code data to do "fine-tuning". You can see good results in a few hours or even dozens of minutes.

The repository provides download scripts (Linux/Mac executionbash scripts/download_pretrained.sh, Windows users can download it directly from YOLOv3 官方权重下载页 and put it inweightsdirectory).

5.2 Start fine-tuning training

Before training, please confirmdata/captcha.yamlThe number of categories (nc), image path, label path, and category name are all correct (the default is category 1, and the path has been configured).

Then run the fine-tuning script directly:

bash scripts/train.sh

💡 Training parameter description (available atscripts/train.shMedium adjustment):

  • --img-size: Enter the size of the image (default is 640, if the video memory is not enough, you can change it to 416).
  • --batch-size: The number of images used for each training (the larger the video memory, the larger this value can be set, the default is 8).
  • --epochs: Number of training rounds (default 100, if the loss on the validation set no longer decreases, it can be stopped early).
  • --data: Just confirmedcaptcha.yamlpath.
  • --weights: Pre-trained weight path.

5.3 Use TensorBoard to view training results

During the training process, the script will automatically save indicators such as loss and accuracy tologsDirectory, it is very convenient to use TensorBoard to visualize:

tensorboard --logdir=logs

Open browser to visithttp://localhost:6006, focusing on two curves:

  • val/box_loss: Bounding box loss of the validation set (the lower the better, it is basically enough if it is reduced to about 0.05-0.1).
  • val/mAP_0.5: The average accuracy of the verification set when IoU=0.5 (the higher the better, single target detection is stable if it can reach 0.9 or above).

6. Model testing

6.1 Prepare test data

Put the verification code images that did not participate in training (such as the last 20% of the images collected) intodata/captcha/testin the directory.

6.2 Run detection script

The optimal model saved during the training process will be placed incheckpoints/best.pt, directly use the detection script of the warehouse for testing:

bash scripts/detect.sh

The detection results (framed pictures, txt files containing notch coordinates) are automatically output todata/captcha/resultTable of contents.

7. Model deployment: Make a usable API

The detection script alone is not enough. In actual crawler or automated testing, we need an HTTP API that can receive images and return the x coordinate of the upper left corner of the gap (the sliding distance mainly depends on x). Here we use lightweight FastAPI to quickly implement:

from fastapi import FastAPI, UploadFile, File, HTTPException
from PIL import Image
import io
import torch
from models.yolo import Model
from utils.general import non_max_suppression, scale_coords
from utils.augmentations import letterbox
import numpy as np

app = FastAPI(title="滑动验证码缺口识别API", version="1.0")

# ---------------- 全局加载模型(只加载一次,避免每次请求都重新加载) ----------------
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 1. 加载模型结构
model = Model(cfg='models/yolov3.yaml', ch=3, nc=1).to(DEVICE)
# 2. 加载最优权重
checkpoint = torch.load('checkpoints/best.pt', map_location=DEVICE)
model.load_state_dict(checkpoint['model'])
# 3. 切换到评估模式(关闭 dropout、batchnorm 的训练特性)
model.eval()

def preprocess(image: Image.Image, img_size: int = 640):
    """YOLOv3 标准预处理:letterbox缩放 + 转tensor + 归一化"""
    # 1. 转为 RGB(防止上传的是灰度或 RGBA 图片)
    img = image.convert('RGB')
    # 2. letterbox缩放:保持长宽比,用黑边填充到指定尺寸
    img = letterbox(img, new_shape=img_size, auto=False)[0]
    # 3. 转为 numpy -> (H, W, C) -> (C, H, W) -> torch tensor
    img = np.asarray(img).transpose(2, 0, 1)
    img = torch.from_numpy(img).float().to(DEVICE)
    # 4. 归一化到 0-1
    img /= 255.0
    # 5. 增加 batch 维度(YOLOv3 要求输入为 batch 形式)
    if img.ndimension() == 3:
        img = img.unsqueeze(0)
    return img, image.size

@app.post("/predict", summary="识别滑动验证码缺口")
async def predict(file: UploadFile = File(...)):
    # 1. 读取上传的图片
    try:
        image_bytes = await file.read()
        image = Image.open(io.BytesIO(image_bytes))
    except Exception:
        raise HTTPException(status_code=400, detail="上传的不是有效图片")
    
    # 2. 预处理
    img_tensor, original_size = preprocess(image)
    
    # 3. 预测(不计算梯度,加快推理速度)
    with torch.no_grad():
        pred = model(img_tensor)[0]  # 取第一个输出(检测结果)
        # 4. 非极大值抑制(NMS):过滤重叠框,只保留置信度最高的
        pred = non_max_suppression(pred, conf_thres=0.5, iou_thres=0.5)[0]
    
    # 5. 后处理:将 letterbox 缩放后的框坐标还原到原图
    if pred is not None and len(pred):
        pred = scale_coords(img_tensor.shape[2:], pred[:, :4], original_size).round()
        # 取第一个(也是唯一的)框的坐标
        x1, y1, x2, y2 = pred[0].tolist()
        return {
            "success": True,
            "data": {
                "x1": int(x1),  # 缺口左上角 x 坐标(滑动距离就是这个值!)
                "y1": int(y1),
                "x2": int(x2),
                "y2": int(y2),
                "width": int(x2 - x1),
                "height": int(y2 - y1)
            }
        }
    else:
        raise HTTPException(status_code=404, detail="未检测到缺口")

Start the API (using uvicorn, the ASGI server officially recommended by FastAPI):

uvicorn deploy:app --reload --host 0.0.0.0 --port 8000

After startup, visithttp://localhost:8000/docsYou can open the interactive document that comes with FastAPI and directly upload images to test the interface!

8. Optimization suggestions

If the model is not performing as expected on your target website, you can try the following optimization directions:

  1. Data Augmentation: Indata/captcha.yamlEnable or add data enhancement (such as random flipping, cropping, brightness/contrast adjustment), or write your own script to generate more diverse samples.
  2. Switch to more advanced YOLO versions: For example, YOLOv5, YOLOv8, etc. Their accuracy and speed are far superior to YOLOv3, and the warehouse architecture is more friendly and easier to deploy.
  3. Deployment Acceleration: Convert the model to ONNX or TensorRT format, and the inference speed can be increased by several to dozens of times.
  4. Active learning: Find samples with low model prediction confidence (for example, conf_thres is between 0.5-0.7), manually label them and add them to the training set for retraining. The model will become smarter as it is used.

9. Summary

This article uses YOLOv3 to run through the complete process of sliding verification code gap identification: from data collection and annotation to model training and deployment, and finally outputs an API that can be directly called. Compared with traditional image processing methods, deep learning solutions have much stronger generalization capabilities. When changing a website, you only need to re-collect/label a small amount of data and fine-tune it to quickly adapt.

The complete code can be directly viewed in the original warehouse: DeepLearningSlideCaptcha2