title: Deep learning recognition click verification code description: With the improvement of network security awareness, various websites have adopted increasingly complex anti-crawler measures, among which verification codes are one of the most common protection methods. Captcha technology has undergone the following evolutions:

🎯 Deep learning in action: full process tutorial on click verification code recognition

Project address: https://github.com/zxinED/Text_select_captcha

The pain point of clicking on the verification code is that it has no fixed semantics, and the font, position, and background are ever-changing. Using ordinary OCR (such as PaddleOCR) to directly identify text in the background, the accuracy is often less than 70%, and adding a little interference will completely blind you.

This open source project adopts a very clever idea: use the "divide and conquer method" to split the complex problem into two independent classic computer vision subtasks, completely bypassing OCR, and ultimately achieve a stable accuracy rate of more than 96%.


1. Core principle: No semantic recognition, only "positioning + matching"

Why not OCR directly?

The logic of ordinary OCR is "you must have seen this character before you can recognize it." However, the font in the click verification code may be stretched, rotated, bolded, and deformed, coupled with interference lines and complex backgrounds - when encountering an uncommon character, or the same character is changed to a different handwriting style, the recognition result will directly deviate.

Core steps of the divide and conquer method

  1. Target Detection No matter what words are on the picture, first find all the clickable character boxes in the background picture and the title marking boxes in the top/bottom prompt boxes to get the precise coordinates and the cut mini-picture.

  2. Image Matching Use Twin Network to determine the similarity between the "word block in the prompt box" and the "word block in the background image" - the higher the similarity, the more likely it is the same word. Finally, output the corresponding coordinates in the background in the order prompted.


2. environment-setup: minimalist configuration, CPU can also run

The project supports Python 3.6 / 3.8 / 3.10. It does not rely on high-end GPUs and can run stably with ordinary CPUs.

Quick Start Steps

# 1. 克隆或下载项目到本地
git clone https://github.com/zxinED/Text_select_captcha.git
cd Text_select_captcha

# 2. 一键安装依赖
pip install -r requirements.txt

# 3. 用自带的测试图验证基础环境
python demo.py

If you want to test your own pictures, just put the pictures in the project directory and modifydemo.pyJust click the image path in .


Task dismantling: Turn "word finding" into a standard 2-category target detection task.

Why choose YOLOv5s6?

  • Lightweight: The model is small in size and fast inference, even a low-configuration server with 1 core and 2 G can achieve single image detection within 300 ms;
  • Easy to use: Mature labeling tools (LabelImg, LabelStudio), and official pre-training weights are provided;
  • Balancing accuracy and speed: It is more accurate than the tiny version and faster than the x version. It is very suitable for verification code scenarios where there are many small targets in a single image and high real-time requirements.

Labeling and training specifications (advanced user reference)

If you need to train the model yourself, you only need to distinguish 2 categories when labeling:

  • char: All clickable character areas in the background image (it is recommended to keep the annotation method consistent with the pre-training model provided by the project, and mark all valid candidate boxes);
  • target: Text block as question sequence in the top/bottom prompt box.

After training is completed, it is recommended to export it to ONNX or encrypted .bin format to facilitate cross-platform deployment and avoid direct theft of the model.


Task dismantling: Turn "determining whether they are the same word" into a problem of "calculating feature vector similarity" (such as cosine similarity, no need to worry about mathematical details, just understand the logic).

Why use twin network?

Ordinary classification models need to use all possible Chinese characters as classification labels - for example, for 6,000 commonly used Chinese characters, the classification layer must have 6,000 outputs. If you want to train well, you need a large number of samples. Once you encounter rare words, you will get stuck.

The advantage of the twin network is "small sample friendly + strong generalization":

  • Training logic: Input two pictures, and the model directly outputs the binary classification result (0 / 1) of "whether they are the same word";
  • No need for massive data: There is no need to prepare thousands of samples for each word, as long as there are enough "same word/different word" picture pairs;
  • Coping with changing scenarios: Regardless of whether the text is handwritten, printed, rotated, or with interference, the model only extracts the "essential features of the text" and has extremely strong generalization capabilities.

5. Deployment and calling: Support native Python and FastAPI services

The project provides two calling methods, which can be quickly integrated into your own crawler script, or can be used as an independent API service for the web platform to call.

1. Python native rapid integration

Scenarios suitable for writing crawler scripts directly:

from src.captcha import TextSelectCaptcha

# 初始化模型
# per_path: 孪生网络模型路径(支持 ONNX 或加密的 .bin)
# yolo_path: YOLOv5 检测模型路径(同上)
# sign=True: 使用加密的 .bin 模型;sign=False: 使用自定义 ONNX
cap = TextSelectCaptcha(
    per_path='pre_model_v2.bin', 
    yolo_path='best_v2.bin', 
    sign=True
)

# 执行识别(输入可以是本地图片路径或 PIL.Image 对象)
result = cap.run("docs/res.jpg")

# 输出:按提示顺序排列的点击坐标,格式 [(x1, y1), (x2, y2), ...]
print(f"推荐点击顺序坐标:{result}")

2. FastAPI service deployment

Suitable for scenarios that require cross-language calls (Java/Go crawlers, front-end display, etc.):

# 启动 FastAPI 服务(默认端口 8000)
python service.py

After successful startup, visithttp://127.0.0.1:8000/docsYou can see the automatically generated interactive API documentation, and you can test it online by directly uploading images.


6. Advanced risk prevention and control: real-person simulation click strategy

After identifying the coordinates, Never click directly on the center point - most verification code risk control systems will detect mouse trajectories and click behaviors:

  1. The click position is randomly shifted Within the recognized rectangular frame (it is recommended to leave a safe margin of 1~2 pixels on the top, bottom, left, and right), a point is randomly generated as the actual click position.

  2. Bezier Curve Mouse Movement useDrissionPageorPyppeteer, to achieve smooth curve movement from the current mouse position to the target point. The moving speed should simulate a real person: first fast and then slow, and a little random jitter can be added in the middle.

  3. Click interval randomization If there are multiple click points, the interval time is randomly selected between 0.3~1.2 seconds to avoid being mechanically uniform like a robot.


💡 Tutorial Summary

The core value of this open source project lies in "Get out of the OCR mindset and use the divide-and-conquer method to solve complex problems": It does not recognize specific Chinese characters, but uses "precise positioning + similarity matching" to replace two independent tasks, which not only reduces the difficulty of training, but also greatly improves accuracy and generalization capabilities.

If you are a CV beginner, this project is an excellent practical case for learning "Target Detection + Twin Network"; if you are a crawler developer, it can help you quickly get more than 90% of text click verification codes.