title: OCR recognition graphic verification code description: With the improvement of network security awareness, various websites have adopted increasingly complex anti-crawler measures, among which verification codes are one of the most common protection methods. Captcha technology has undergone the following evolutions:

OCR recognizes graphic verification code

From basic preprocessing to PaddleOCR/Selenium 4 automation process (2024 version)


1. Overview of verification code technology

With the improvement of network security awareness, the anti-crawler systems of various websites are iterating faster and faster. Verification Code (CAPTCHA) As the first line of defense for human-machine identification, the technical form has already jumped out of the category of pure character combinations:

  1. First generation basic model: pure numbers/letters, mixed uppercase and lowercase, no obvious interference
  2. Lightweight Enhanced Version: Add tilt, rotation, interference points and lines, and random background
  3. Chinese localization: Use common Chinese characters or rare characters, and even add idioms and semantic interference
  4. Complex behavior model: For example, 12306 click on similar objects and click on text to complete the poem.
  5. AI-assisted interactive model: sliding puzzle (including gap detection), trajectory recognition verification

This tutorial focuses on the general identification solutions for the first three purely static graphic verification codes. Interactive verification codes such as sliding and clicking will be shared separately.


2. Basics of graphic verification code recognition technology

2.1 OCR technology selection reference

The core of OCR (Optical Character Recognition) is to convert visual text in images into editable text. To build a lightweight crawler in 2024, there is no need to train the model from scratch. The following solutions can be sorted by ease of use, accuracy, and cost:

Solution typeRepresentative tools/APIApplicable scenarios2024 status quo
Open source traditional engineTesseract 5.x+Purely basic verification code with minimal interferenceSuitable for practice, accuracy fluctuates greatly
Open source deep learning libraryPaddleOCR 2.x+Lightweight enhanced/Chinese localization verification codePreferred by domestic users, high accuracy
Business APIBaidu/Tencent/Alibaba Cloud OCRBatch processing of light/medium difficulty verification codesHighest accuracy, free quota

2.2 Preparation of lightweight development environment

Python 3.9+ is currently the most stable combination of crawler and OCR, relying on on-demand installation:

Common dependencies (must be installed)

pip install selenium pillow numpy opencv-python

Option 1: Tesseract practice dependency

# Python 封装库
pip install pytesseract

# 安装 Tesseract 引擎(不同系统命令不同)
# Windows(需先安装 Chocolatey)
choco install tesseract
# macOS(需先安装 Homebrew)
brew install tesseract
# Linux(Debian/Ubuntu)
sudo apt install tesseract-ocr

Option 2: PaddleOCR practical recommended dependencies

# CPU 版(适合个人电脑)
pip install paddleocr paddlepaddle

# GPU 版(需匹配 CUDA 版本,大批量处理推荐)
# 参考:https://www.paddlepaddle.org.cn/install/quick

3. Practical implementation of the whole process of verification code identification

3.1 Warm-up: directly use Tesseract to identify basic models

The basic verification code has almost no interference and can be recognized without preprocessing, but has low tolerance for tilt, color, etc.:

import pytesseract
from PIL import Image

def tesseract_basic(image_path: str) -> str:
    """直接使用 Tesseract 识别基础款验证码"""
    img = Image.open(image_path)
    # 配置:单块文本识别,限定字母数字白名单
    config = (
        r"--psm 7 --oem 3 "
        r"-c tessedit_char_whitelist="
        r"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
    )
    return pytesseract.image_to_string(img, config=config).strip()

# 测试
if __name__ == "__main__":
    print(tesseract_basic("simple_captcha.png"))

Tip:--psm 7Treat images as single lines of text;whitelistLimiting the recognition character set can effectively reduce misrecognition.

3.2 Key steps: OpenCV verification code preprocessing

The main interference of the lightweight enhanced verification code comes from background noise, interference lines and color confusion. The goal of preprocessing is to completely separate text from the background, leaving only clean black and white text outlines.

import cv2
import numpy as np

def opencv_preprocess(image_path: str) -> np.ndarray:
    """轻量增强款验证码的通用预处理流程"""
    # 1. 读取原始图像
    img = cv2.imread(image_path)

    # 2. 灰度化 —— 降低计算量,突出明暗对比
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # 3. 自适应阈值二值化 —— 适用背景渐变的情况
    binary = cv2.adaptiveThreshold(
        gray, 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY_INV,  # 反色,让文字变为白色(255)
        11,   # 邻域大小(奇数,11 是常用经验值)
        2     # 偏移量,控制阈值灵敏度
    )

    # 4. 中值滤波去噪 —— 清除椒盐噪点、断掉的干扰线
    denoised = cv2.medianBlur(binary, 3)

    # 5. 形态学闭操作 —— 填补文字内部小缺口
    kernel = np.ones((2, 2), np.uint8)
    processed = cv2.morphologyEx(denoised, cv2.MORPH_CLOSE, kernel)

    # 调试时可保存结果查看
    # cv2.imwrite("processed.png", processed)
    return processed

3.3 2024 Practical first choice: PaddleOCR recognizes verification codes with preprocessing

PaddleOCR has built-in Chinese and English pre-training models. With pre-processing, it is extremely robust to common interference and has an accuracy of over 90%.

from paddleocr import PaddleOCR
import cv2

# 全局初始化(首次运行自动下载轻量模型,约 10 MB)
ocr = PaddleOCR(use_angle_cls=True, lang="en", show_log=False)

def paddle_ocr_processed(processed_img: np.ndarray) -> str:
    """识别预处理后的验证码图像"""
    result = ocr.ocr(processed_img, cls=True)
    if result and result[0]:
        # result[0][0][1] 为 (文本, 置信度)
        text, confidence = result[0][0][1]
        # 可选:过滤低置信度结果
        # if confidence < 0.8:
        #     return ""
        return text.strip()
    return ""

def full_paddle_ocr(image_path: str) -> str:
    """组合预处理与识别"""
    processed = opencv_preprocess(image_path)
    return paddle_ocr_processed(processed)

if __name__ == "__main__":
    print(full_paddle_ocr("enhanced_captcha.png"))

lang parameter description:"en"Prioritize English recognition;"ch"Mixed recognition of Chinese and English is more suitable for Chinese verification code scenarios.


4. Automated login practice (Selenium 4 + PaddleOCR)

Taking the simulated login verification code website of "scrape.center" as an example, the entire process is connected into a complete automated script.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import random
import cv2
import numpy as np
from paddleocr import PaddleOCR

# ---------- 复用前面的预处理与 OCR 函数 ----------
ocr = PaddleOCR(use_angle_cls=True, lang="en", show_log=False)

def opencv_preprocess(image_path: str) -> np.ndarray:
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    binary = cv2.adaptiveThreshold(
        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY_INV, 11, 2
    )
    denoised = cv2.medianBlur(binary, 3)
    kernel = np.ones((2, 2), np.uint8)
    return cv2.morphologyEx(denoised, cv2.MORPH_CLOSE, kernel)

def paddle_ocr_processed(processed_img: np.ndarray) -> str:
    result = ocr.ocr(processed_img, cls=True)
    if result and result[0]:
        text, _ = result[0][0][1]
        return text.strip()
    return ""

def full_paddle_ocr(image_path: str) -> str:
    return paddle_ocr_processed(opencv_preprocess(image_path))

# ---------- 自动化登录 ----------
def auto_login_scrape_center():
    # 1. 初始化浏览器并伪装基础指纹
    options = webdriver.ChromeOptions()
    options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
    )
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option("useAutomationExtension", False)

    service = webdriver.ChromeService()
    driver = webdriver.Chrome(service=service, options=options)

    try:
        # 2. 打开目标网站
        driver.get("https://captcha7.scrape.center/")
        time.sleep(random.uniform(1, 2))

        # 3. 输入用户名和密码
        username = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.NAME, "username"))
        )
        username.send_keys("admin")
        time.sleep(random.uniform(0.5, 1))

        password = driver.find_element(By.NAME, "password")
        password.send_keys("admin")
        time.sleep(random.uniform(0.5, 1))

        # 4. 截取验证码图片并识别
        captcha_img = driver.find_element(By.CSS_SELECTOR, ".captcha img")
        captcha_img.screenshot("current_captcha.png")  # 元素级截图
        code = full_paddle_ocr("current_captcha.png")
        print(f"识别到的验证码:{code}")

        # 5. 填入验证码并提交
        captcha_input = driver.find_element(By.NAME, "captcha")
        captcha_input.send_keys(code)
        time.sleep(random.uniform(0.5, 1))

        submit_btn = driver.find_element(By.CSS_SELECTOR, "button[type='submit']")
        submit_btn.click()

        # 6. 验证登录结果
        WebDriverWait(driver, 10).until(
            EC.text_to_be_present_in_element((By.TAG_NAME, "h2"), "登录成功")
        )
        print("🎉 登录成功!")

    finally:
        time.sleep(2)
        driver.quit()

if __name__ == "__main__":
    auto_login_scrape_center()

Key Tip: Useelement.screenshot()Directly capture the verification code element to avoid cropping the entire page and improve stability.


5. Lightweight anti-crawling techniques (2024 update)

Identifying the verification code is only the first step. You also need to cooperate with the following strategies to reduce risk control risks:

  1. Random Delay Join before and after all operationsrandom.uniform(a, b)Random waiting to avoid millisecond execution.

  2. IP Rotation Use the proxy pool when there are frequent requests (paid proxies are recommended, free availability is very low):

    options.add_argument('--proxy-server=http://your_proxy_ip:port')
  3. Retry if verification code error PaddleOCR is not 100% accurate, adding 3 to 5 recognition-submission cycles. After failure, refresh the verification code and try again.

  4. Control request frequency When crawling in batches, control the request frequency within 10 to 20 times per minute to avoid triggering risk control.


6. Summary and Outlook

This tutorial shares the full process practice of identifying static graphic verification codes by lightweight crawlers in 2024:

  • OpenCV preprocessing separates text and noise
  • PaddleOCR pre-trained model completes high-precision recognition
  • Selenium 4 implements automated login

This combination is enough to handle the lightweight enhancements and Chinese verification codes of most websites.

If you encounter complex verification codes that are extremely distorted, densely packed with rare characters, etc., you can try:

  • Call commercial APIs such as Baidu and Tencent to obtain higher accuracy
  • Collect verification code samples from the site and train a dedicated small data set model

References