Analysis of classic CNN architecture: Milestone evolution and core innovation from LeNet to DenseNet

Introduction

The development process of convolutional neural network (CNN) is an evolutionary history of deep learning. From LeNet in 1998 to today’s Vision Transformers, every architectural innovation has redefined the capabilities of computer vision. This article will give you an in-depth understanding of classic CNN architectures such as LeNet, AlexNet, VGG, ResNet and DenseNet, sort out how they overcome training problems step by step, improve performance, and provide you with reproducible code implementation and design ideas.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 卷积核、步长与池化 · 手写数字识别 (MNIST) 实战

1. LeNet (1998) - the foundational work of deep learning

1.1 Historical background and significance

LeNet was proposed by Yann LeCun in 1998 and was the first true convolutional neural network. It was originally used for handwritten digit recognition, achieved breakthrough results on the MNIST data set, and established a basic paradigm for all subsequent CNN architectures.

LeNet-5 architecture structure

Input layer (32×32) → C1 convolution layer (6 5×5 cores) → S2 pooling layer → C3 convolution layer (16 5×5 cores) → S4 pooling layer → C5 fully connected convolution layer → F6 fully connected layer → output layer

Core Basic Innovation

Introducing the combination of convolutional layer and pooling layer for the first time
Proposed parameter sharing mechanism
Design Partial Connection mode

import torch
import torch.nn as nn
import torch.nn.functional as F

class LeNet5(nn.Module):
    """
    LeNet-5网络实现
    """
    def __init__(self, num_classes=10):
        super(LeNet5, self).__init__()
        
        # C1: 卷积层 - 6个5×5卷积核
        self.conv1 = nn.Conv2d(1, 6, kernel_size=5)
        # S2: 平均池化层 - 2×2窗口
        self.pool1 = nn.AvgPool2d(kernel_size=2, stride=2)
        # C3: 卷积层 - 16个5×5卷积核
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
        # S4: 平均池化层 - 2×2窗口
        self.pool2 = nn.AvgPool2d(kernel_size=2, stride=2)
        # C5: 全连接卷积层 - 120个5×5卷积核
        self.conv3 = nn.Conv2d(16, 120, kernel_size=5)
        # F6: 全连接层 - 84个神经元
        self.fc1 = nn.Linear(120, 84)
        # 输出层 - 10个神经元（对应10个数字类别）
        self.fc2 = nn.Linear(84, num_classes)
        
        # 激活函数
        self.tanh = nn.Tanh()
    
    def forward(self, x):
        # C1: 卷积 + 激活
        x = self.tanh(self.conv1(x))
        # S2: 池化
        x = self.pool1(x)
        # C3: 卷积 + 激活
        x = self.tanh(self.conv2(x))
        # S4: 池化
        x = self.pool2(x)
        # C5: 卷积 + 激活
        x = self.tanh(self.conv3(x))
        # 展平
        x = x.view(x.size(0), -1)
        # F6: 全连接 + 激活
        x = self.tanh(self.fc1(x))
        # 输出层
        x = self.fc2(x)
        return x

LeNet-5 parameter analysis

Input: 32×32 grayscale image
C1: 6×(5×5)+6 = 156 parameters
S2: 2×2 average pooling, no parameters
C3: 16×6×(5×5)+16 = 2,416 parameters
S4: 2×2 average pooling, no parameters
C5: 120×16×(5×5)+120 = 48,120 parameters
F6: 120×84+84 = 10,164 parameters
Output: 84×10+10 = 850 parameters
Total number of parameters: ~61,700

1.2 LeNet’s innovation and modern impact

Core innovation logic

Convolution layer:

The same convolution kernel slides across the entire image to realize parameter sharing and greatly reduce the amount of parameters.
Each output neuron is only connected to the local area of the input to obtain the local receptive field
No matter where the number appears in the image, the convolution kernel can detect the same features and is naturally translation invariant.

Pooling layer:

Compress the spatial size of feature maps and reduce the amount of calculation
More robust to small translations, further improving translation invariance

Hierarchical feature extraction:

The shallow layer (C1/S2) is responsible for extracting basic features such as edges and textures.
Deep layers (C3/S4/C5) gradually combine abstract features with more semantic information (such as digital parts)

Impact on modern CNN

Established the basic architecture of "convolution + pooling stacking to extract features, fully connected layer classification"
Parameter sharing and local connections become the underlying design principles of all CNNs
The idea of hierarchical feature extraction is still the core of the visual model today

2. AlexNet (2012) - a milestone in the renaissance of deep learning

2.1 Historical significance and breakthroughs

In 2012, AlexNet proposed by Alex Krizhevsky and others won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC 2012) with a Top-5 error rate of 15.3% (the second place error rate was as high as 26.2%). This result shocked the entire computer vision community, marking the official opening of the deep learning era and bringing GPU training into the public eye.

AlexNet architecture structure

Input: 224×224 RGB image

Feature extraction part

Conv1: 96 11×11 convolution kernels, stride 4 MaxPool1: 3×3 window, step size 2 Conv2: 256 5×5 convolution kernels, using grouped convolution (adapted to dual GPU parallelism at the time) MaxPool2: 3×3 window, step size 2 Conv3: 384 3×3 convolution kernels Conv4: 384 3×3 convolution kernels, grouped convolution Conv5: 256 3×3 convolution kernels, grouped convolution MaxPool3: 3×3 window, step size 2

Classification section

FC1: 4096 neurons FC2: 4096 neurons FC3: 1000 neurons (corresponding to the number of ImageNet categories)

import torch
import torch.nn as nn

class AlexNet(nn.Module):
    """
    AlexNet网络实现（简化分组卷积，适配单GPU）
    """
    def __init__(self, num_classes=1000):
        super(AlexNet, self).__init__()
        
        # 特征提取部分
        self.features = nn.Sequential(
            # Conv1: 96个11×11卷积核，步长4，padding2确保输出55×55
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            # MaxPool1: 3×3窗口，步长2
            nn.MaxPool2d(kernel_size=3, stride=2),
            
            # Conv2: 原256分组→简化为192
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            # MaxPool2: 3×3窗口，步长2
            nn.MaxPool2d(kernel_size=3, stride=2),
            
            # Conv3: 原384→简化为384核心层
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            
            # Conv4: 原384→简化为256
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            
            # Conv5: 原256
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            # MaxPool3: 3×3窗口，步长2
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        
        # 自适应平均池化，适配任意输入尺寸
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        
        # 分类器部分
        self.classifier = nn.Sequential(
            # Dropout: 丢弃率0.5
            nn.Dropout(p=0.5),
            # FC1: 4096个神经元
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            # Dropout: 丢弃率0.5
            nn.Dropout(p=0.5),
            # FC2: 4096个神经元
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            # FC3: 1000个神经元
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

2.2 AlexNet’s technological innovation

Six key breakthroughs

ReLU activation function The definition of the ReLU function is very simple: when the input is greater than 0, the output is unchanged, and when the input is less than or equal to 0, it outputs 0.

The gradient is always 1 in the positive range, completely avoiding the gradient disappearance problem caused by saturated activation functions such as Sigmoid and tanh.
The amount of calculation is small, and the training speed is several times faster than tanh

Dropout regularization

During training, the neuron output of the fully connected layer is randomly set to zero with a 50% probability.
This forces each neuron to not be overly dependent on other specific neurons, destroying "co-adaptation" and thereby significantly preventing overfitting

Data Augmentation

Randomly crop a 224×224 area from the original 256×256 image
Horizontal flip with 50% probability
Apply PCA color perturbation to RGB pixel values to simulate lighting changes
These methods double the number of training samples and effectively alleviate overfitting.

Overlap Pooling

The pooling window size is set to 3×3 and the stride is set to 2, which allows pixel overlap between adjacent pooling windows
Compared with traditional non-overlapping pooling (window size equals the step size), this design can further suppress overfitting

Local Response Normalization (LRN)

Drawing on the "lateral inhibition" principle of neurons in biology, local competition is performed between channels to enhance the generalization ability of the model.
However, in subsequent VGG, ResNet and other models, the role of LRN was replaced by the more effective BN layer

GPU parallel computing

Trained for about 6 days using two GTX 580 graphics cards
The grouped convolution in the network split the model into two parts and ran them on two GPUs respectively, which solved the problem of insufficient video memory on a single card at that time.

AlexNet parameter scale

The total number of parameters in the original paper is about 60 million, and more than 80% are concentrated in the fully connected layer (FC1, FC2), which is also the focus of later model simplification.

3. VGGNet (2014) - a model of depth and unity

3.1 VGGNet design concept

In 2014, the Visual Geometry Group of Oxford University proposed VGGNet, which is famous for its minimalist and unified structure and persistent exploration of depth. VGGNet continuously stacks small convolution kernels one after another, proving that network depth is the key to improving performance, and thus established the design paradigm of "building deep networks with small convolution kernels".

VGGNet core design rules

Unified convolution kernel: All convolutional layers only use 3×3 small convolution kernels
Unified Pooling: All pooling layers are 2×2 windows, stride 2
Channel doubling: After each pooling (the space size is halved), the number of channels is doubled to maintain the balance between time and space.
Fully connected ending: Finally, three consecutive fully connected layers are used to complete the classification

Mainstream VGG version

Version	Number of convolutional layers	Number of fully connected layers	Total number of parameters (ImageNet)
VGG-11	8	3	~132M
VGG-13	10	3	~133M
VGG-16	13	3	~138M
VGG-19	16	3	~143M

import torch
import torch.nn as nn

class VGG(nn.Module):
    """
    VGG网络实现
    """
    def __init__(self, features, num_classes=1000, init_weights=True):
        super(VGG, self).__init__()
        self.features = features
        self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )
        if init_weights:
            self._initialize_weights()

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

def make_layers(cfg, batch_norm=False):
    """
    根据配置创建VGG层
    """
    layers = []
    in_channels = 3
    for v in cfg:
        if v == 'M':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        else:
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            if batch_norm:
                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
            else:
                layers += [conv2d, nn.ReLU(inplace=True)]
            in_channels = v
    return nn.Sequential(*layers)

# VGG配置
cfgs = {
    'A': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],  # VGG-11
    'B': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],  # VGG-13
    'D': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],  # VGG-16
    'E': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],  # VGG-19
}

def vgg16(pretrained=False, **kwargs):
    """
    VGG-16网络（最常用版本）
    """
    if pretrained:
        kwargs['init_weights'] = False
    model = VGG(make_layers(cfgs['D']), **kwargs)
    return model

3.2 Architectural advantages of VGGNet

Two core advantages of small convolution kernel stacking

Equivalent receptive field, richer nonlinearity

Using two 3×3 convolutions in succession, the theoretical receptive field size is equivalent to a 5×5 convolution
Using three 3×3 convolutions in succession, the receptive field is equivalent to a 7×7 convolution
However, there will be more ReLU activations during the stacking of small convolution kernels, so the network has stronger expressive ability and can learn more complex decision boundaries.

Higher parameter efficiency Let’s take the number of input channels and the number of output channels as both C as an example for comparison:

The amount of parameters (including bias) required for a 5×5 convolutional layer is: C × (5×5) × C + C, which is about 26C (simplified here as C is an independent dimension)
The total parameters (including bias) of the two 3×3 convolutional layers are: 2 × [ C × (3×3) × C + C ], approximately 20C
Parameter saving is about 23%. If the bias term is ignored, the savings ratio is even higher.

Because of this efficient design, VGG-16 and VGG-19 are still widely used as the backbone networks for feature extraction.

4. ResNet (2015) - Solve the problem of deep network training

4.1 Proposal of Residual Learning

In 2015, He Kaiming and others from Microsoft Research proposed ResNet and introduced Residual Connection, which solved the deep network training degradation problem that has troubled the academic community for many years—that is, the phenomenon that the training error does not decrease but increases after the network is deepened. ResNet makes training networks with more than a hundred or even thousands of layers stable and controllable, and won the championship in ILSVRC 2015 with a Top-5 error rate of 3.57%.

The essence of network degradation problem

According to theoretical assumptions, deeper networks can at least achieve the performance of shallow networks by learning identity mapping (that is, the output is equal to the input). However, it is very difficult to directly fit the identity mapping by stacking nonlinear layers alone. The gradient gradually decays during backpropagation, making it difficult to effectively train deep networks.

The core idea of residual learning

ResNet changes the learning goal from the direct expectation mapping H(x) to learning the residual mapping F(x) = H(x) - x**. The final network output is:

Output = Residual + Input

If the desired mapping is the identity mapping, the network only needs to learn the residual mapping F(x) to 0 - this is much easier than directly learning the identity mapping (in theory, it only needs to set all the convolution kernel weights to zero). More importantly, the residual connection provides a "highway" for direct backpropagation of gradients, which fundamentally alleviates the vanishing gradient problem.

import torch
import torch.nn as nn

class BasicBlock(nn.Module):
    """
    基础残差块（用于ResNet-18, ResNet-34）
    """
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=3, stride=stride,
                               padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1,
                               padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        # 残差连接：将跳跃连接identity与卷积输出相加
        out += identity
        out = self.relu(out)

        return out

class Bottleneck(nn.Module):
    """
    瓶颈残差块（用于ResNet-50, ResNet-101, ResNet-152）
    expansion=4，通过先降维再升维大幅减少计算量
    """
    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(Bottleneck, self).__init__()
        # 1×1卷积降维
        self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        # 3×3卷积提取特征
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,
                               padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        # 1×1卷积升维
        self.conv3 = nn.Conv2d(planes, planes * self.expansion, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

class ResNet(nn.Module):
    """
    ResNet网络
    """
    def __init__(self, block, layers, num_classes=1000):
        super(ResNet, self).__init__()
        self.inplanes = 64
        # 初始层：7×7卷积+3×3最大池化，快速降维
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        
        # 四个残差块组
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        
        # 全局平均池化+全连接分类
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        # 初始化权重
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def _make_layer(self, block, planes, blocks, stride=1):
        downsample = None
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(self.inplanes, planes * block.expansion,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x

def resnet18(**kwargs):
    """ResNet-18（轻量级骨干网）"""
    return ResNet(BasicBlock, [2, 2, 2, 2], **kwargs)

def resnet50(**kwargs):
    """ResNet-50（最常用的标准骨干网）"""
    return ResNet(Bottleneck, [3, 4, 6, 3], **kwargs)

5. DenseNet (2016) - The ultimate in dense connections

5.1 Innovation in dense connections

In 2016, researchers from Cornell University and Tsinghua University proposed DenseNet, which takes feature reuse to the extreme in a Dense Connection method. In DenseNet, each layer receives the feature maps output by all previous layers as input, and passes its own output to all subsequent layers.

The core idea of dense connection

Suppose there are L layers in a network. The input of layer l does not only come from the previous layer, but all the output feature maps from layer 0 (input) to layer l-1 are spliced together in the channel dimension. This spliced huge feature block will be sent to a composite function Hl (usually a combination of BN → ReLU → Conv) for processing.

Core components of DenseNet

Dense Block: Dense connections are implemented internally, and the feature map continues to grow in the channel dimension.
Transition Layer: The transition layer sandwiched between Dense Blocks, responsible for compressing the number of channels (usually halved) and reducing the space size
Growth Rate: The number of channels of the new feature map output by each layer in the Dense Block, recorded as k (common values 12 or 32), which controls the "fatness" of the model

import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import OrderedDict

class _DenseLayer(nn.Sequential):
    """
    DenseNet层：BN→ReLU→1×1 Conv→BN→ReLU→3×3 Conv
    先用1×1 Conv降维，减少计算量
    """
    def __init__(self, num_input_features, growth_rate, bn_size, drop_rate):
        super(_DenseLayer, self).__init__()
        self.add_module('norm1', nn.BatchNorm2d(num_input_features)),
        self.add_module('relu1', nn.ReLU(inplace=True)),
        self.add_module('conv1', nn.Conv2d(num_input_features, bn_size *
                                           growth_rate, kernel_size=1, stride=1,
                                           bias=False)),
        self.add_module('norm2', nn.BatchNorm2d(bn_size * growth_rate)),
        self.add_module('relu2', nn.ReLU(inplace=True)),
        self.add_module('conv2', nn.Conv2d(bn_size * growth_rate, growth_rate,
                                           kernel_size=3, stride=1, padding=1,
                                           bias=False)),
        self.drop_rate = drop_rate

    def forward(self, x):
        new_features = super(_DenseLayer, self).forward(x)
        if self.drop_rate > 0:
            new_features = F.dropout(new_features, p=self.drop_rate,
                                   training=self.training)
        # 密集连接：将新特征与之前所有特征拼接
        return torch.cat([x, new_features], 1)

class _DenseBlock(nn.Sequential):
    """DenseNet块：包含多个DenseLayer"""
    def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate):
        super(_DenseBlock, self).__init__()
        for i in range(num_layers):
            layer = _DenseLayer(num_input_features + i * growth_rate,
                               growth_rate, bn_size, drop_rate)
            self.add_module('denselayer%d' % (i + 1), layer)

class _Transition(nn.Sequential):
    """过渡层：通道数减半 + 空间尺寸减半"""
    def __init__(self, num_input_features, num_output_features):
        super(_Transition, self).__init__()
        self.add_module('norm', nn.BatchNorm2d(num_input_features))
        self.add_module('relu', nn.ReLU(inplace=True))
        self.add_module('conv', nn.Conv2d(num_input_features, num_output_features,
                                          kernel_size=1, stride=1, bias=False))
        self.add_module('pool', nn.AvgPool2d(kernel_size=2, stride=2))

class DenseNet(nn.Module):
    """DenseNet网络"""
    def __init__(self, growth_rate=32, block_config=(6, 12, 24, 16),
                 num_init_features=64, bn_size=4, drop_rate=0, num_classes=1000):
        super(DenseNet, self).__init__()

        self.features = nn.Sequential(OrderedDict([
            ('conv0', nn.Conv2d(3, num_init_features, kernel_size=7, stride=2,
                                padding=3, bias=False)),
            ('norm0', nn.BatchNorm2d(num_init_features)),
            ('relu0', nn.ReLU(inplace=True)),
            ('pool0', nn.MaxPool2d(kernel_size=3, stride=2, padding=1)),
        ]))

        num_features = num_init_features
        for i, num_layers in enumerate(block_config):
            block = _DenseBlock(num_layers=num_layers,
                               num_input_features=num_features,
                               bn_size=bn_size,
                               growth_rate=growth_rate,
                               drop_rate=drop_rate)
            self.features.add_module('denseblock%d' % (i + 1), block)
            num_features = num_features + num_layers * growth_rate
            if i != len(block_config) - 1:
                trans = _Transition(num_input_features=num_features,
                                   num_output_features=num_features // 2)
                self.features.add_module('transition%d' % (i + 1), trans)
                num_features = num_features // 2

        self.features.add_module('norm5', nn.BatchNorm2d(num_features))
        self.classifier = nn.Linear(num_features, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        features = self.features(x)
        out = F.relu(features, inplace=True)
        out = F.adaptive_avg_pool2d(out, (1, 1))
        out = torch.flatten(out, 1)
        out = self.classifier(out)
        return out

Core advantages of DenseNet

Maximize feature reuse: Each layer can directly access the basic features generated by all previous layers, reducing repeated learning.
Extremely high parameter efficiency: Under the same accuracy, the number of parameters of DenseNet is usually only about 1/3 of ResNet.
Effectively alleviate gradient disappearance: Dense connections provide multiple gradient return paths for shallow layers
Information flows more smoothly: Feature splicing allows data to be transferred between different layers with almost no bottlenecks.

6. Comparison and evolution summary of classic CNN architecture

6.1 Comparison of architecture core indicators

Architecture	Year	Core Innovation	Common Depth	Total Parameters (ImageNet)	Core Problem Solving
LeNet	1998	CNN infrastructure, parameter sharing	5	~60K	Prove the feasibility of CNN in visual tasks
AlexNet	2012	ReLU, Dropout, GPU parallelism	8	~60M	Opening the era of deep learning and breaking through performance bottlenecks
VGGNet	2014	Unified small convolution kernel, depth exploration	16-19	~138M	Prove that depth is the key to improving performance
ResNet	2015	Residual connection, batch normalization	50-152	~25M (ResNet-50)	Solve the training degradation problem of deep networks
DenseNet	2016	Dense connection, feature reuse	121-201	~8M (DenseNet-121)	Maximize feature reuse and further improve parameter efficiency

6.2 Evolution of architectural design concepts

From light to dark

Developed from 5 layers of LeNet to 200+ layers of DenseNet
Core obstacles: gradient disappearance, training degradation → Solution tools: ReLU, residual connection

Stacking from large core to small core

Large convolution kernels of 5×5, 7×7 or even 11×11 are widely used in LeNet/AlexNet
After VGGNet, 3×3 small convolution kernel stacking is uniformly adopted → Equivalent receptive fields, more nonlinearity, and more efficient parameters

From direct connection to jump connection

Traditional network: direct connection layer by layer
ResNet: Use addition to implement residual skip connections
DenseNet: Use channel splicing to achieve dense jump connections → open up more paths for gradients and information propagation

From single component to standardized composite component

Basic components: convolution, pooling, activation
Modern components: fixed combinations of BN → ReLU → Conv, or even the more refined BN → ReLU → 1×1 Conv → BN → ReLU → 3×3 Conv in DenseNet

Understanding the evolution of the classic CNN architecture is a shortcut to mastering deep learning. Please focus on what core pain points each architecture solves and the design logic behind the innovations. In particular, the residual connection idea of ResNet is not only the standard configuration of modern deep networks, but also an important foundation for the residual variant in Vision Transformer.

7. Summary

The development history of the classic CNN architecture is an evolutionary history from "feasible" to "extreme optimization". Every breakthrough solved the core obstacles of deep learning at that time:

Core Milestones

LeNet: laid the basic structure of CNN and proposed parameter sharing and local connection.
AlexNet: Introducing key technologies such as ReLU and Dropout to open the era of deep learning with the help of GPU
VGGNet: Use unified 3×3 convolution to explore the depth limit and establish the importance of depth to performance
ResNet: Residual connection completely solves the training degradation of deep networks, making it possible to train hundreds of layers of networks
DenseNet: Dense connections achieve ultimate feature reuse and achieve higher accuracy with fewer parameters.

Core technology heritage

Parameter Sharing and Local Connection: the essential properties of CNN
ReLU: preferred activation function for deep networks
Residual connection: standard feature of modern deep networks
Hierarchical feature extraction: the core idea of all visual models

💡 Important reminder: It is recommended that you give priority to implementing and thoroughly understanding ResNet-18 and ResNet-50. They are currently the most widely used backbone networks and are the cornerstones of many cutting-edge visual architectures.

🔗 Extended reading

#Analysis of classic CNN architecture: Milestone evolution and core innovation from LeNet to DenseNet

#Introduction

#1. LeNet (1998) - the foundational work of deep learning

#1.1 Historical background and significance

#LeNet-5 architecture structure

#Core Basic Innovation

#LeNet-5 parameter analysis

#1.2 LeNet’s innovation and modern impact

#Core innovation logic

#Impact on modern CNN

#2. AlexNet (2012) - a milestone in the renaissance of deep learning

#2.1 Historical significance and breakthroughs

#AlexNet architecture structure

#Feature extraction part

#Classification section

#2.2 AlexNet’s technological innovation

#Six key breakthroughs

#AlexNet parameter scale

#3. VGGNet (2014) - a model of depth and unity

#3.1 VGGNet design concept

#VGGNet core design rules

#Mainstream VGG version

#3.2 Architectural advantages of VGGNet

#Two core advantages of small convolution kernel stacking

#4. ResNet (2015) - Solve the problem of deep network training

#4.1 Proposal of Residual Learning

#The essence of network degradation problem

#The core idea of ​​residual learning

#5. DenseNet (2016) - The ultimate in dense connections

#5.1 Innovation in dense connections

#The core idea of ​​dense connection

#Core components of DenseNet

#Core advantages of DenseNet

#6. Comparison and evolution summary of classic CNN architecture

#6.1 Comparison of architecture core indicators

#6.2 Evolution of architectural design concepts

#Related tutorials

#7. Summary

#Core Milestones

#Core technology heritage