title: Detailed explanation of convolutional neural network (CNN): from basic principles to PyTorch implementation | Daoman PythonAI description: In-depth analysis of the basic principles, core components, PyTorch implementation and practical application scenarios of convolutional neural networks (CNN), including detailed code examples and mathematical principles. keywords: [Convolutional neural network, CNN, deep learning, computer vision, PyTorch, convolutional layer, pooling layer, image recognition]

Detailed explanation of convolutional neural network (CNN): from basic principles to PyTorch implementation

Introduction

Convolutional Neural Network (CNN) is one of the most mature and widely used visual-specific architectures in the field of deep learning. In 2012, AlexNet relied on CNN to directly reduce the error rate from 26% to 15% in the ImageNet competition, which was nearly 11 percentage points lower than the second place. This detonated the golden age of deep learning in computer vision.

To this day, CNN is still the preferred solution for tasks such as image recognition, lightweight target detection, and medical image analysis. This article will start from the core idea, disassemble the key components, and finally use PyTorch to implement a ready-to-use model to help you truly understand CNN.


1. The core idea of ​​CNN

1.1 Fatal flaws of traditional fully connected networks (MLP)

Before the advent of CNN, processing images could only forcibly "straighten" two-dimensional pixels into one-dimensional vectors, and then feed them to the multi-layer perceptron (MLP). This approach has two shortcomings that cannot be ignored:

  1. Parameter explosion: A 1024×1024 RGB image, after straightening, is a 3,145,728-dimensional vector. If the first layer only has 1,000 neurons, then the weight matrix W alone contains more than 3 billion parameters, and the video memory and computing power are directly blocked.
  2. Loss of spatial information: For example, in the 28×28 handwritten number "7", the "horizontal bent hook" in the upper half and the "vertical" in the lower half have a fixed spatial position relationship. Once the sequence is straightened out of order, MLP completely loses these semantic associations and cannot distinguish between "slash" and "number 7" at all.

1.2 Two core innovations of CNN

The design of CNN naturally adapts to the local correlation and translation invariance of the image. It simulates the hierarchical feature extraction process of the human visual system: First look at the local part (edges, texture) → middle-level combination (eyes, nose) → deep judgment (human face or cat face).

Supporting this logic are two core mechanisms:

  1. Local Receptive Field: Each neuron only connects to a small area of ​​the input image, rather than the entire image, thereby reducing the number of parameters and forcing the network to learn local features.
  2. Weight sharing: The same convolution kernel (that is, the same set of weights) is calculated slidingly across the entire image. In this way, no matter whether the feature appears in the upper left corner or the lower right corner, it can be detected by the same convolution kernel. This not only further significantly reduces the number of parameters, but also enhances the translation invariance of the model.

2. Disassembly of the core components of CNN

A standard CNN is stacked by multiple convolution blocks (convolution → activation → pooling), and finally connected to a fully connected classifier. Let’s break it down one by one.

2.1 Convolutional Layer

The convolutional layer is responsible for extracting local features and is the "eyes" of the entire network.

Core parameters and operation principles

  • Number of input/output channels: The input channel corresponds to 3 channels of RGB images or 1 channel of grayscale images; the number of output channels is equal to the number of convolution kernels, and each convolution kernel learns a feature (such as edge, texture, etc.).
  • Convolution kernel (kernel): The most commonly used size is 3×3, taking into account both computational efficiency and effective receptive field.
  • Step: The pixel distance that the convolution kernel slides each time. When the step size is 1, the feature map size is almost unchanged, and when the step size is 2, downsampling is performed.
  • Padding: Padding zeros at the edges of the input image can prevent the output size from shrinking too quickly and protect edge information from being ignored.

Knowing these parameters, the rules of output feature map size are very clear: Assuming that the input size is W, the convolution kernel size is F, the padding is P, and the stride is S, then the output size is calculated as -(W - F + 2P) / SThe result is rounded down and 1 is added. For example, if the input is 32×32, 3×3 convolution kernel, padding=1, stride=1, the output remains 32×32.

The parameter quantities of the convolutional layers are also easy to estimate: (卷积核高度 × 卷积核宽度 × 输入通道数 + 1) × 输出通道数
of which+1Represents the bias term that comes with each convolution kernel.

PyTorch Code Example

import torch
import torch.nn as nn
import torch.nn.functional as F

# 定义一个用于 RGB 图像的基础卷积层
conv_basic = nn.Conv2d(
    in_channels=3,      # RGB 输入
    out_channels=32,    # 学习 32 种不同特征
    kernel_size=3,      # 3×3 卷积核
    stride=1,           # 步长为 1
    padding=1           # 边缘补 0,保持输出尺寸不变(same padding)
)

2.2 Activation function and pooling layer

  • Activation function: Introduce nonlinearity to the network, otherwise multi-layer convolution is equivalent to a single-layer linear transformation and cannot learn complex combination features. The first choice is ReLU, which sets all negative values ​​of the input to 0 and leaves positive values ​​unchanged. It is very fast to calculate and can effectively alleviate the vanishing gradient problem of deep networks. Can be called directly in PyTorchF.relu(x)
  • Pooling layer: Reduce the amount of parameters and calculations in subsequent layers through dimensionality reduction while retaining key features. For example, max pooling retains the maximum value in the window, which is equivalent to extracting the "most significant edge" or "brightest texture". The most commonly used configuration is max pooling with a 2×2 window and a stride of 2, which can directly reduce the feature map size by half.

PyTorch Code Example

# 激活函数
x_relu = F.relu(x)

# 池化层:2x2 窗口,步长 2 ——> 高和宽各减半
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

3. Classic & modern CNN architecture implementation

Below, PyTorch is used to implement two practical models: LeNet-5 (entry-level MNIST handwritten digit recognition) and ModernCNN (lightweight CIFAR-10 classification). You can run and train them directly.

# -------------- LeNet-5(适配 28×28 灰度 MNIST) --------------
class LeNet5(nn.Module):
    def __init__(self, num_classes=10):
        super(LeNet5, self).__init__()
        # 卷积块 1:1 -> 6 通道,卷积核 5×5,padding=2 保持尺寸 28×28
        self.conv1 = nn.Conv2d(1, 6, kernel_size=5, padding=2)
        self.pool1 = nn.MaxPool2d(2, 2)               # 28 -> 14
        # 卷积块 2:6 -> 16 通道,卷积核 5×5(不补零),尺寸 14 -> 10
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
        self.pool2 = nn.MaxPool2d(2, 2)               # 10 -> 5
        # 全连接分类器:16 通道 × 5×5 特征图
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, num_classes)

    def forward(self, x):
        x = self.pool1(F.relu(self.conv1(x)))   # 卷积1 -> 激活 -> 池化
        x = self.pool2(F.relu(self.conv2(x)))   # 卷积2 -> 激活 -> 池化
        x = torch.flatten(x, 1)                 # 展平成一维
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)


# -------------- ModernCNN(适配 32×32 RGB CIFAR-10) --------------
class ModernCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(ModernCNN, self).__init__()
        # 特征提取序列(带 BatchNorm 和 Dropout 提升泛化能力)
        self.features = nn.Sequential(
            # 块1:32 通道,两次卷积 + 最大池化
            nn.Conv2d(3, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.Conv2d(32, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout(0.25),

            # 块2:64 通道,两次卷积 + 最大池化
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout(0.25),

            # 块3:128 通道,卷积后使用自适应平均池化,直接压缩到 1×1
            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((1, 1)),
        )
        # 分类器
        self.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        return self.classifier(x)


# -------------- 快速形状测试 --------------
if __name__ == "__main__":
    # 测试 LeNet-5
    lenet = LeNet5()
    dummy_mnist = torch.randn(1, 1, 28, 28)
    print(f"LeNet-5 输入: {dummy_mnist.shape} → 输出: {lenet(dummy_mnist).shape}")

    # 测试 ModernCNN
    modern = ModernCNN()
    dummy_cifar = torch.randn(1, 3, 32, 32)
    print(f"ModernCNN 输入: {dummy_cifar.shape} → 输出: {modern(dummy_cifar).shape}")

4. Key tuning and best practices

4.1 Data preprocessing (taking CIFAR-10 as an example)

Good data preprocessing is the minimum guarantee for model effect. Here is a standard process:

from torchvision import transforms

# 训练集:加入数据增强,提升模型泛化能力
train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),   # 随机裁剪,模拟不同视角
    transforms.RandomHorizontalFlip(),      # 随机水平翻转
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],   # ImageNet 统计值
                         std=[0.229, 0.224, 0.225])
])

# 验证/测试集:只做标准化,保持评估一致性
val_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

4.2 Regularization techniques that cannot be ignored

  • Dropout: Randomly drop a portion of neurons (common ratio 0.2~0.5) at the end of the fully connected layer or convolution block, forcing the network to learn more robust features.
  • Batch Normalization: Normalizes the output of the intermediate layer to speed up training, reduce sensitivity to initialization, and has a slight regularization effect.
  • Data enhancement: Random cropping, flipping, rotation and other methods are equivalent to exponentially expanding the training set without adding annotations, significantly inhibiting over-fitting.

5. Summary

With three major advantages: local receptive fields, weight sharing, and hierarchical features, CNN is still an efficient and reliable choice for visual tasks. Although new architectures such as Vision Transformer continue to emerge, CNN's advantages in lightweight, interpretability, and hardware acceleration make it still irreplaceable in scenarios such as mobile terminals and embedded devices.

It is recommended that you run the above LeNet-5 code first and complete a complete training using the MNIST data set. Then, try using`matplotlib`Visualize the edge and texture features learned by the first convolutional layer - this is the fastest way to build intuition for CNNs!

🔗 Extended reading