title: Detailed explanation of convolutional neural network (CNN): from basic principles to PyTorch implementation | Daoman PythonAI description: In-depth analysis of the basic principles, core components, PyTorch implementation and practical application scenarios of convolutional neural networks (CNN), including detailed code examples and mathematical principles. keywords: [Convolutional neural network, CNN, deep learning, computer vision, PyTorch, convolutional layer, pooling layer, image recognition]
Detailed explanation of convolutional neural network (CNN): from basic principles to PyTorch implementation
Introduction
Convolutional Neural Network (CNN) is one of the most mature and widely used visual-specific architectures in the field of deep learning. In 2012, AlexNet relied on CNN to directly reduce the error rate from 26% to 15% in the ImageNet competition, which was nearly 11 percentage points lower than the second place. This detonated the golden age of deep learning in computer vision.
To this day, CNN is still the preferred solution for tasks such as image recognition, lightweight target detection, and medical image analysis. This article will start from the core idea, disassemble the key components, and finally use PyTorch to implement a ready-to-use model to help you truly understand CNN.
1. The core idea of CNN
1.1 Fatal flaws of traditional fully connected networks (MLP)
Before the advent of CNN, processing images could only forcibly "straighten" two-dimensional pixels into one-dimensional vectors, and then feed them to the multi-layer perceptron (MLP). This approach has two shortcomings that cannot be ignored:
- Parameter explosion: A 1024×1024 RGB image, after straightening, is a 3,145,728-dimensional vector. If the first layer only has 1,000 neurons, then the weight matrix W alone contains more than 3 billion parameters, and the video memory and computing power are directly blocked.
- Loss of spatial information: For example, in the 28×28 handwritten number "7", the "horizontal bent hook" in the upper half and the "vertical" in the lower half have a fixed spatial position relationship. Once the sequence is straightened out of order, MLP completely loses these semantic associations and cannot distinguish between "slash" and "number 7" at all.
1.2 Two core innovations of CNN
The design of CNN naturally adapts to the local correlation and translation invariance of the image. It simulates the hierarchical feature extraction process of the human visual system: First look at the local part (edges, texture) → middle-level combination (eyes, nose) → deep judgment (human face or cat face).
Supporting this logic are two core mechanisms:
- Local Receptive Field: Each neuron only connects to a small area of the input image, rather than the entire image, thereby reducing the number of parameters and forcing the network to learn local features.
- Weight sharing: The same convolution kernel (that is, the same set of weights) is calculated slidingly across the entire image. In this way, no matter whether the feature appears in the upper left corner or the lower right corner, it can be detected by the same convolution kernel. This not only further significantly reduces the number of parameters, but also enhances the translation invariance of the model.
2. Disassembly of the core components of CNN
A standard CNN is stacked by multiple convolution blocks (convolution → activation → pooling), and finally connected to a fully connected classifier. Let’s break it down one by one.
2.1 Convolutional Layer
The convolutional layer is responsible for extracting local features and is the "eyes" of the entire network.
Core parameters and operation principles
- Number of input/output channels: The input channel corresponds to 3 channels of RGB images or 1 channel of grayscale images; the number of output channels is equal to the number of convolution kernels, and each convolution kernel learns a feature (such as edge, texture, etc.).
- Convolution kernel (kernel): The most commonly used size is 3×3, taking into account both computational efficiency and effective receptive field.
- Step: The pixel distance that the convolution kernel slides each time. When the step size is 1, the feature map size is almost unchanged, and when the step size is 2, downsampling is performed.
- Padding: Padding zeros at the edges of the input image can prevent the output size from shrinking too quickly and protect edge information from being ignored.
Knowing these parameters, the rules of output feature map size are very clear:
Assuming that the input size is W, the convolution kernel size is F, the padding is P, and the stride is S, then the output size is calculated as -(W - F + 2P) / SThe result is rounded down and 1 is added.
For example, if the input is 32×32, 3×3 convolution kernel, padding=1, stride=1, the output remains 32×32.
The parameter quantities of the convolutional layers are also easy to estimate:
(卷积核高度 × 卷积核宽度 × 输入通道数 + 1) × 输出通道数
of which+1Represents the bias term that comes with each convolution kernel.
PyTorch Code Example
2.2 Activation function and pooling layer
- Activation function: Introduce nonlinearity to the network, otherwise multi-layer convolution is equivalent to a single-layer linear transformation and cannot learn complex combination features.
The first choice is ReLU, which sets all negative values of the input to 0 and leaves positive values unchanged. It is very fast to calculate and can effectively alleviate the vanishing gradient problem of deep networks. Can be called directly in PyTorch
F.relu(x)。 - Pooling layer: Reduce the amount of parameters and calculations in subsequent layers through dimensionality reduction while retaining key features. For example, max pooling retains the maximum value in the window, which is equivalent to extracting the "most significant edge" or "brightest texture". The most commonly used configuration is max pooling with a 2×2 window and a stride of 2, which can directly reduce the feature map size by half.
PyTorch Code Example
3. Classic & modern CNN architecture implementation
Below, PyTorch is used to implement two practical models: LeNet-5 (entry-level MNIST handwritten digit recognition) and ModernCNN (lightweight CIFAR-10 classification). You can run and train them directly.
4. Key tuning and best practices
4.1 Data preprocessing (taking CIFAR-10 as an example)
Good data preprocessing is the minimum guarantee for model effect. Here is a standard process:
4.2 Regularization techniques that cannot be ignored
- Dropout: Randomly drop a portion of neurons (common ratio 0.2~0.5) at the end of the fully connected layer or convolution block, forcing the network to learn more robust features.
- Batch Normalization: Normalizes the output of the intermediate layer to speed up training, reduce sensitivity to initialization, and has a slight regularization effect.
- Data enhancement: Random cropping, flipping, rotation and other methods are equivalent to exponentially expanding the training set without adding annotations, significantly inhibiting over-fitting.
5. Summary
With three major advantages: local receptive fields, weight sharing, and hierarchical features, CNN is still an efficient and reliable choice for visual tasks. Although new architectures such as Vision Transformer continue to emerge, CNN's advantages in lightweight, interpretability, and hardware acceleration make it still irreplaceable in scenarios such as mobile terminals and embedded devices.
🔗 Extended reading

