Analysis of classic CNN architecture: Milestone evolution and core innovation from LeNet to DenseNet
Introduction
The development process of convolutional neural network (CNN) is an evolutionary history of deep learning. From LeNet in 1998 to today’s Vision Transformers, every architectural innovation has redefined the capabilities of computer vision. This article will give you an in-depth understanding of classic CNN architectures such as LeNet, AlexNet, VGG, ResNet and DenseNet, sort out how they overcome training problems step by step, improve performance, and provide you with reproducible code implementation and design ideas.
📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 卷积核、步长与池化 · 手写数字识别 (MNIST) 实战
1. LeNet (1998) - the foundational work of deep learning
1.1 Historical background and significance
LeNet was proposed by Yann LeCun in 1998 and was the first true convolutional neural network. It was originally used for handwritten digit recognition, achieved breakthrough results on the MNIST data set, and established a basic paradigm for all subsequent CNN architectures.
LeNet-5 architecture structure
Input layer (32×32) → C1 convolution layer (6 5×5 cores) → S2 pooling layer → C3 convolution layer (16 5×5 cores) → S4 pooling layer → C5 fully connected convolution layer → F6 fully connected layer → output layer
Core Basic Innovation
- Introducing the combination of convolutional layer and pooling layer for the first time
- Proposed parameter sharing mechanism
- Design Partial Connection mode
LeNet-5 parameter analysis
- Input: 32×32 grayscale image
- C1: 6×(5×5)+6 = 156 parameters
- S2: 2×2 average pooling, no parameters
- C3: 16×6×(5×5)+16 = 2,416 parameters
- S4: 2×2 average pooling, no parameters
- C5: 120×16×(5×5)+120 = 48,120 parameters
- F6: 120×84+84 = 10,164 parameters
- Output: 84×10+10 = 850 parameters
- Total number of parameters: ~61,700
1.2 LeNet’s innovation and modern impact
Core innovation logic
- Convolution layer:
- The same convolution kernel slides across the entire image to realize parameter sharing and greatly reduce the amount of parameters.
- Each output neuron is only connected to the local area of the input to obtain the local receptive field
- No matter where the number appears in the image, the convolution kernel can detect the same features and is naturally translation invariant.
- Pooling layer:
- Compress the spatial size of feature maps and reduce the amount of calculation
- More robust to small translations, further improving translation invariance
- Hierarchical feature extraction:
- The shallow layer (C1/S2) is responsible for extracting basic features such as edges and textures.
- Deep layers (C3/S4/C5) gradually combine abstract features with more semantic information (such as digital parts)
Impact on modern CNN
- Established the basic architecture of "convolution + pooling stacking to extract features, fully connected layer classification"
- Parameter sharing and local connections become the underlying design principles of all CNNs
- The idea of hierarchical feature extraction is still the core of the visual model today
2. AlexNet (2012) - a milestone in the renaissance of deep learning
2.1 Historical significance and breakthroughs
In 2012, AlexNet proposed by Alex Krizhevsky and others won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC 2012) with a Top-5 error rate of 15.3% (the second place error rate was as high as 26.2%). This result shocked the entire computer vision community, marking the official opening of the deep learning era and bringing GPU training into the public eye.
AlexNet architecture structure
Input: 224×224 RGB image
Feature extraction part
Conv1: 96 11×11 convolution kernels, stride 4 MaxPool1: 3×3 window, step size 2 Conv2: 256 5×5 convolution kernels, using grouped convolution (adapted to dual GPU parallelism at the time) MaxPool2: 3×3 window, step size 2 Conv3: 384 3×3 convolution kernels Conv4: 384 3×3 convolution kernels, grouped convolution Conv5: 256 3×3 convolution kernels, grouped convolution MaxPool3: 3×3 window, step size 2
Classification section
FC1: 4096 neurons FC2: 4096 neurons FC3: 1000 neurons (corresponding to the number of ImageNet categories)
2.2 AlexNet’s technological innovation
Six key breakthroughs
- ReLU activation function The definition of the ReLU function is very simple: when the input is greater than 0, the output is unchanged, and when the input is less than or equal to 0, it outputs 0.
- The gradient is always 1 in the positive range, completely avoiding the gradient disappearance problem caused by saturated activation functions such as Sigmoid and tanh.
- The amount of calculation is small, and the training speed is several times faster than tanh
- Dropout regularization
- During training, the neuron output of the fully connected layer is randomly set to zero with a 50% probability.
- This forces each neuron to not be overly dependent on other specific neurons, destroying "co-adaptation" and thereby significantly preventing overfitting
- Data Augmentation
- Randomly crop a 224×224 area from the original 256×256 image
- Horizontal flip with 50% probability
- Apply PCA color perturbation to RGB pixel values to simulate lighting changes
- These methods double the number of training samples and effectively alleviate overfitting.
- Overlap Pooling
- The pooling window size is set to 3×3 and the stride is set to 2, which allows pixel overlap between adjacent pooling windows
- Compared with traditional non-overlapping pooling (window size equals the step size), this design can further suppress overfitting
- Local Response Normalization (LRN)
- Drawing on the "lateral inhibition" principle of neurons in biology, local competition is performed between channels to enhance the generalization ability of the model.
- However, in subsequent VGG, ResNet and other models, the role of LRN was replaced by the more effective BN layer
- GPU parallel computing
- Trained for about 6 days using two GTX 580 graphics cards
- The grouped convolution in the network split the model into two parts and ran them on two GPUs respectively, which solved the problem of insufficient video memory on a single card at that time.
AlexNet parameter scale
The total number of parameters in the original paper is about 60 million, and more than 80% are concentrated in the fully connected layer (FC1, FC2), which is also the focus of later model simplification.
3. VGGNet (2014) - a model of depth and unity
3.1 VGGNet design concept
In 2014, the Visual Geometry Group of Oxford University proposed VGGNet, which is famous for its minimalist and unified structure and persistent exploration of depth. VGGNet continuously stacks small convolution kernels one after another, proving that network depth is the key to improving performance, and thus established the design paradigm of "building deep networks with small convolution kernels".
VGGNet core design rules
- Unified convolution kernel: All convolutional layers only use 3×3 small convolution kernels
- Unified Pooling: All pooling layers are 2×2 windows, stride 2
- Channel doubling: After each pooling (the space size is halved), the number of channels is doubled to maintain the balance between time and space.
- Fully connected ending: Finally, three consecutive fully connected layers are used to complete the classification
Mainstream VGG version
3.2 Architectural advantages of VGGNet
Two core advantages of small convolution kernel stacking
- Equivalent receptive field, richer nonlinearity
- Using two 3×3 convolutions in succession, the theoretical receptive field size is equivalent to a 5×5 convolution
- Using three 3×3 convolutions in succession, the receptive field is equivalent to a 7×7 convolution
- However, there will be more ReLU activations during the stacking of small convolution kernels, so the network has stronger expressive ability and can learn more complex decision boundaries.
- Higher parameter efficiency Let’s take the number of input channels and the number of output channels as both C as an example for comparison:
- The amount of parameters (including bias) required for a 5×5 convolutional layer is: C × (5×5) × C + C, which is about 26C (simplified here as C is an independent dimension)
- The total parameters (including bias) of the two 3×3 convolutional layers are: 2 × [ C × (3×3) × C + C ], approximately 20C
- Parameter saving is about 23%. If the bias term is ignored, the savings ratio is even higher.
Because of this efficient design, VGG-16 and VGG-19 are still widely used as the backbone networks for feature extraction.
4. ResNet (2015) - Solve the problem of deep network training
4.1 Proposal of Residual Learning
In 2015, He Kaiming and others from Microsoft Research proposed ResNet and introduced Residual Connection, which solved the deep network training degradation problem that has troubled the academic community for many years—that is, the phenomenon that the training error does not decrease but increases after the network is deepened. ResNet makes training networks with more than a hundred or even thousands of layers stable and controllable, and won the championship in ILSVRC 2015 with a Top-5 error rate of 3.57%.
The essence of network degradation problem
According to theoretical assumptions, deeper networks can at least achieve the performance of shallow networks by learning identity mapping (that is, the output is equal to the input). However, it is very difficult to directly fit the identity mapping by stacking nonlinear layers alone. The gradient gradually decays during backpropagation, making it difficult to effectively train deep networks.
The core idea of residual learning
ResNet changes the learning goal from the direct expectation mapping H(x) to learning the residual mapping F(x) = H(x) - x**. The final network output is:
Output = Residual + Input
If the desired mapping is the identity mapping, the network only needs to learn the residual mapping F(x) to 0 - this is much easier than directly learning the identity mapping (in theory, it only needs to set all the convolution kernel weights to zero). More importantly, the residual connection provides a "highway" for direct backpropagation of gradients, which fundamentally alleviates the vanishing gradient problem.
5. DenseNet (2016) - The ultimate in dense connections
5.1 Innovation in dense connections
In 2016, researchers from Cornell University and Tsinghua University proposed DenseNet, which takes feature reuse to the extreme in a Dense Connection method. In DenseNet, each layer receives the feature maps output by all previous layers as input, and passes its own output to all subsequent layers.
The core idea of dense connection
Suppose there are L layers in a network. The input of layer l does not only come from the previous layer, but all the output feature maps from layer 0 (input) to layer l-1 are spliced together in the channel dimension. This spliced huge feature block will be sent to a composite function Hl (usually a combination of BN → ReLU → Conv) for processing.
Core components of DenseNet
- Dense Block: Dense connections are implemented internally, and the feature map continues to grow in the channel dimension.
- Transition Layer: The transition layer sandwiched between Dense Blocks, responsible for compressing the number of channels (usually halved) and reducing the space size
- Growth Rate: The number of channels of the new feature map output by each layer in the Dense Block, recorded as k (common values 12 or 32), which controls the "fatness" of the model
Core advantages of DenseNet
- Maximize feature reuse: Each layer can directly access the basic features generated by all previous layers, reducing repeated learning.
- Extremely high parameter efficiency: Under the same accuracy, the number of parameters of DenseNet is usually only about 1/3 of ResNet.
- Effectively alleviate gradient disappearance: Dense connections provide multiple gradient return paths for shallow layers
- Information flows more smoothly: Feature splicing allows data to be transferred between different layers with almost no bottlenecks.
6. Comparison and evolution summary of classic CNN architecture
6.1 Comparison of architecture core indicators
6.2 Evolution of architectural design concepts
- From light to dark
- Developed from 5 layers of LeNet to 200+ layers of DenseNet
- Core obstacles: gradient disappearance, training degradation → Solution tools: ReLU, residual connection
- Stacking from large core to small core
- Large convolution kernels of 5×5, 7×7 or even 11×11 are widely used in LeNet/AlexNet
- After VGGNet, 3×3 small convolution kernel stacking is uniformly adopted → Equivalent receptive fields, more nonlinearity, and more efficient parameters
- From direct connection to jump connection
- Traditional network: direct connection layer by layer
- ResNet: Use addition to implement residual skip connections
- DenseNet: Use channel splicing to achieve dense jump connections → open up more paths for gradients and information propagation
- From single component to standardized composite component
- Basic components: convolution, pooling, activation
- Modern components: fixed combinations of BN → ReLU → Conv, or even the more refined BN → ReLU → 1×1 Conv → BN → ReLU → 3×3 Conv in DenseNet
Related tutorials
7. Summary
The development history of the classic CNN architecture is an evolutionary history from "feasible" to "extreme optimization". Every breakthrough solved the core obstacles of deep learning at that time:
Core Milestones
- LeNet: laid the basic structure of CNN and proposed parameter sharing and local connection.
- AlexNet: Introducing key technologies such as ReLU and Dropout to open the era of deep learning with the help of GPU
- VGGNet: Use unified 3×3 convolution to explore the depth limit and establish the importance of depth to performance
- ResNet: Residual connection completely solves the training degradation of deep networks, making it possible to train hundreds of layers of networks
- DenseNet: Dense connections achieve ultimate feature reuse and achieve higher accuracy with fewer parameters.
Core technology heritage
- Parameter Sharing and Local Connection: the essential properties of CNN
- ReLU: preferred activation function for deep networks
- Residual connection: standard feature of modern deep networks
- Hierarchical feature extraction: the core idea of all visual models
💡 Important reminder: It is recommended that you give priority to implementing and thoroughly understanding ResNet-18 and ResNet-50. They are currently the most widely used backbone networks and are the cornerstones of many cutting-edge visual architectures.
🔗 Extended reading

