title: Detailed explanation of CRNN: end-to-end variable-length text recognition model | Daoman PythonAI description: In-depth analysis of the CRNN (Convolutional Recurrent Neural Network) model and introduction to its application in the OCR field, including detailed architecture analysis, PyTorch implementation and practical application scenarios. keywords: [CRNN, OCR, optical character recognition, text recognition, deep learning, computer vision, PyTorch, sequence recognition]
Detailed explanation of CRNN: End-to-end variable-length text recognition model
When you use your mobile phone to scan courier orders and automatically fill in addresses, use parking cameras to read license plates in seconds, and convert PDF to Word to extract plain text, there is a high probability that there is an efficient variable-length text recognition engine behind it - and CRNN is the originator of this type of engine and one of the cornerstones of industrial applications.
Introduction
In the early days of optical character recognition (OCR), "first segment single characters and then classify" was the mainstream idea, but this solution had fatal flaws:
- Relies on complex character segmentation algorithms and cannot handle glued characters, fuzzy deformed characters, and natural scene tilted characters
- The annotation cost is extremely high, requiring manual selection of each character.
- Unable to handle text sequences with non-uniform length and ambiguous sentence fragments
In 2015, CRNN (Convolutional Recurrent Neural Network) proposed by Baoguang Shi et al. completely broke this pattern. Through the three-stage architecture of "CNN feature extraction → BiLSTM sequence modeling → CTC alignment decoding", it achieves complete end-to-end variable-length text sequence recognition for the first time, without any character-level segmentation and annotation.
1. Overview of CRNN model
1.1 Core three-stage logic
The design philosophy of CRNN is very clear: understand the image as a "time sequence" from left to right, and each column of pixels is a time step. The specific process is as follows:
One sentence summary: **Convert CNN from a tool of "image classification/detection" to a feature producer of "feeding visual time steps to the sequence model", then use BiLSTM to complete the language/structural association between characters, and finally rely on CTC to solve the problem of length mismatch between the output and the label. **
1.2 Core industrial-grade advantages
- ✅ End-to-end training: only requires the paired data of "image → text"
- ✅ Handle any aspect ratio: The height is fixed at 32, and the width can be infinitely scalable (as long as the input sequence length ≥ the target text length)
- ✅ Lightweight and efficient: The inference speed is 3-10 times that of the Transformer-based model, suitable for edge device deployment
- ✅ Less dependence: No need for dictionary assistance (a dictionary can improve it, but it is not necessary)
- ✅ Strong interpretability: Each time step corresponds to a column of pixels on the image, making it easy to debug errors
2. PyTorch minimalist implementation
To help you get started quickly, here is a cropped and optimized version of VGG+double-layer BiLSTM+standard CTC compatible PyTorch implementation. The code is only about 200 lines, and it is fully trainable and inferable.
3. Quick Guide to Training and Inference
3.1 Training (CTC Loss usage details)
PyTorch built-innn.CTCLossFully compatible with CRNN output, but pay attention to the following parameters:
3.2 Reasoning (greedy decoding implementation)
The simplest decoding method, no dictionary required, suitable for quick verification:
4. Practical suggestions
4.1 Data processing
- The height of the input image must be fixed at 32**, the width is scaled according to the original image ratio, and the long side does not exceed 256/512 (adjusted according to the video memory)
- Grayscale images are usually better than RGB (unless the character color has a strong color dependence on the background)
- Data enhancement: Random slight tilt (-15°~15°), random stretching (width 0.9-1.1), adding Gaussian noise/blur, contrast adjustment, these 4 types have the greatest improvement to CRNN
4.2 Model deployment
- Edge devices (mobile phones/camera): use
torch.onnxConvert to ONNX and use againONNX Runtime-TensorRT/NCNN/TNNAcceleration, inference speed can reach 100fps+ - Cloud/server: Just use PyTorch inference or TensorRT acceleration
Summarize
CRNN is a milestone model in the field of OCR from "traditional segmentation" to "end-to-end recognition". Although Transformer-based models (such as CRNN-Transformer, PARSeq, and MASTER) currently dominate in terms of accuracy, CRNN's lightweight, efficient, and less-dependent features are still the first choice for standardized scenarios such as license plate recognition, bill recognition, and document line recognition.
It is recommended to master the implementation code of this article first, and then try toSynthText/IIIT5K/Train on your own data set, and finally compare the effects of different models!
🔗 Related Resources

