title: OCR recognition graphic verification code description: With the improvement of network security awareness, various websites have adopted increasingly complex anti-crawler measures, among which verification codes are one of the most common protection methods. Captcha technology has undergone the following evolutions:
OCR recognizes graphic verification code
From basic preprocessing to PaddleOCR/Selenium 4 automation process (2024 version)
1. Overview of verification code technology
With the improvement of network security awareness, the anti-crawler systems of various websites are iterating faster and faster. Verification Code (CAPTCHA) As the first line of defense for human-machine identification, the technical form has already jumped out of the category of pure character combinations:
- First generation basic model: pure numbers/letters, mixed uppercase and lowercase, no obvious interference
- Lightweight Enhanced Version: Add tilt, rotation, interference points and lines, and random background
- Chinese localization: Use common Chinese characters or rare characters, and even add idioms and semantic interference
- Complex behavior model: For example, 12306 click on similar objects and click on text to complete the poem.
- AI-assisted interactive model: sliding puzzle (including gap detection), trajectory recognition verification
This tutorial focuses on the general identification solutions for the first three purely static graphic verification codes. Interactive verification codes such as sliding and clicking will be shared separately.
2. Basics of graphic verification code recognition technology
2.1 OCR technology selection reference
The core of OCR (Optical Character Recognition) is to convert visual text in images into editable text. To build a lightweight crawler in 2024, there is no need to train the model from scratch. The following solutions can be sorted by ease of use, accuracy, and cost:
2.2 Preparation of lightweight development environment
Python 3.9+ is currently the most stable combination of crawler and OCR, relying on on-demand installation:
Common dependencies (must be installed)
Option 1: Tesseract practice dependency
Option 2: PaddleOCR practical recommended dependencies
3. Practical implementation of the whole process of verification code identification
3.1 Warm-up: directly use Tesseract to identify basic models
The basic verification code has almost no interference and can be recognized without preprocessing, but has low tolerance for tilt, color, etc.:
Tip:
--psm 7Treat images as single lines of text;whitelistLimiting the recognition character set can effectively reduce misrecognition.
3.2 Key steps: OpenCV verification code preprocessing
The main interference of the lightweight enhanced verification code comes from background noise, interference lines and color confusion. The goal of preprocessing is to completely separate text from the background, leaving only clean black and white text outlines.
3.3 2024 Practical first choice: PaddleOCR recognizes verification codes with preprocessing
PaddleOCR has built-in Chinese and English pre-training models. With pre-processing, it is extremely robust to common interference and has an accuracy of over 90%.
lang parameter description:
"en"Prioritize English recognition;"ch"Mixed recognition of Chinese and English is more suitable for Chinese verification code scenarios.
4. Automated login practice (Selenium 4 + PaddleOCR)
Taking the simulated login verification code website of "scrape.center" as an example, the entire process is connected into a complete automated script.
Key Tip: Use
element.screenshot()Directly capture the verification code element to avoid cropping the entire page and improve stability.
5. Lightweight anti-crawling techniques (2024 update)
Identifying the verification code is only the first step. You also need to cooperate with the following strategies to reduce risk control risks:
-
Random Delay Join before and after all operations
random.uniform(a, b)Random waiting to avoid millisecond execution. -
IP Rotation Use the proxy pool when there are frequent requests (paid proxies are recommended, free availability is very low):
-
Retry if verification code error PaddleOCR is not 100% accurate, adding 3 to 5 recognition-submission cycles. After failure, refresh the verification code and try again.
-
Control request frequency When crawling in batches, control the request frequency within 10 to 20 times per minute to avoid triggering risk control.
6. Summary and Outlook
This tutorial shares the full process practice of identifying static graphic verification codes by lightweight crawlers in 2024:
- OpenCV preprocessing separates text and noise
- PaddleOCR pre-trained model completes high-precision recognition
- Selenium 4 implements automated login
This combination is enough to handle the lightweight enhancements and Chinese verification codes of most websites.
If you encounter complex verification codes that are extremely distorted, densely packed with rare characters, etc., you can try:
- Call commercial APIs such as Baidu and Tencent to obtain higher accuracy
- Collect verification code samples from the site and train a dedicated small data set model
References

