Detailed explanation of convolution kernel, step size and pooling - complete guide to receptive fields, parameter sharing, and convolution operations

Introduction

Convolution kernel size, step size, padding and pooling are the four core elements for building an efficient CNN architecture - they directly determine the feature map size, parameter amount, computational efficiency and receptive field range. This article will explain these concepts in a concise and thorough way, interspersed with the principle of parameter sharing, to help you quickly master the basic design logic of CNN.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 从全连接到卷积 · 经典 CNN 架构剖析

1. Detailed explanation of convolution kernel size

1.1 Comparison of core concepts and parameters

The convolution kernel (Kernel) is a small matrix composed of learnable weights. It will slide position by position on the input feature map according to the step size, and perform the operation of "element-by-element multiplication and summation". The essence is to extract local features.

Parameter sharing is one of the biggest highlights of the convolutional layer: the same convolution kernel shares the same set of weights at all positions in the entire image. No matter how big the input image is, each kernel only needs to learn the kernel parameters once, which greatly reduces the number of parameters and gives the network translational equivariance - for example, whether a face appears on the left or the right, the kernel can detect it with the same pattern.

The following uses the scenario of "inputting a 3-channel RGB image → outputting a 64-channel feature" to intuitively feel the difference in parameter amounts of common kernel sizes:

Convolution kernel size	Number of parameters (only weights)	Core usage
1×1	192	Channel fusion, dimensionality reduction/dimensionality increase
3×3	1,728	General feature extraction (most commonly used in industry/academia)
5×5	4,800	The early CNN large receptive field can now be completely replaced with 2 consecutive 3×3**
7×7	9,408	The initial layer of ResNet and other models, with large step size to quickly compress the resolution

💡 Alternative logic supplement: Two consecutive 3×3 convolutions, with the same number of input and output channels, can cover exactly the same receptive field area as a 5×5, but the number of parameters can be reduced by about 28%. Moreover, one more nonlinear activation (such as ReLU) can be inserted in the middle, making the network expressive more powerful.

The following code verifies the parameter amounts of different convolution kernels, and compares the difference between "single 5×5" and "two consecutive 3×3" (assuming that the input and output channels are both 64):

import torch
import torch.nn as nn

def count_conv_params(module):
    # 仅统计卷积层权重，不含偏置
    return sum(p.numel() for n, p in module.named_parameters() if 'weight' in n)

# 输入、输出通道均为 64 的场景
single_5x5 = nn.Conv2d(64, 64, kernel_size=5, bias=False)               # 5×5 卷积
two_3x3 = nn.Sequential(
    nn.Conv2d(64, 64, kernel_size=3, bias=False),                       # 第一个 3×3
    nn.Conv2d(64, 64, kernel_size=3, bias=False)                        # 第二个 3×3
)

print(f"单个5×5卷积参数: {count_conv_params(single_5x5):,}")          # 25×64×64 = 102,400
print(f"两个3×3卷积参数: {count_conv_params(two_3x3):,}")            # 2×9×64×64 = 73,728
print(f"参数量减少: {100 - count_conv_params(two_3x3)/count_conv_params(single_5x5)*100:.0f}%")

It can be seen that under the same input and output channels, using two 3×3 instead of one 5×5, the number of parameters is reduced from 102,400 to 73,728, while the receptive field remains unchanged. In the actual architecture, bottleneck design (1×1 dimensionality reduction) will be combined to further compress the number of parameters, which will be explained in detail later.

1.2 The “magic” of 1×1 convolution

1×1 convolution is known as the “Swiss Army Knife” in CNN architecture. Although it is small in size, it is very powerful. Its three core functions:

Channel Fusion: Mix information from all channels at the same spatial location, such as integrating RGB color features;
Dimensionality reduction/dimensionality increase: Flexibly adjust the number of channels to effectively reduce the calculation amount of subsequent convolution (ResNet bottleneck layer and MobileNet both rely on it);
Inject nonlinearity: Cooperate with the ReLU activation function to increase the nonlinear expression ability of the model without changing the spatial size.

Minimalist implementation of ResNet bottleneck block (retaining only core logic):

class ResNetBottleneck(nn.Module):
    def __init__(self, in_channels, out_channels, downsample=False):
        super().__init__()
        stride = 2 if downsample else 1
        bottleneck_channels = out_channels // 4   # 标准瓶颈通道数：降维为输出的1/4

        # 1×1降维 → 3×3特征提取（可下采样）→ 1×1升维
        self.main_path = nn.Sequential(
            nn.Conv2d(in_channels, bottleneck_channels, 1, bias=False),
            nn.BatchNorm2d(bottleneck_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(bottleneck_channels, bottleneck_channels, 3, stride=stride, padding=1, bias=False),
            nn.BatchNorm2d(bottleneck_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(bottleneck_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        )

        # 残差连接：当尺寸或通道数不匹配时，用1×1卷积调整
        self.shortcut = nn.Sequential()
        if downsample or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )
        self.final_relu = nn.ReLU(inplace=True)

    def forward(self, x):
        return self.final_relu(self.main_path(x) + self.shortcut(x))

2. Stride and Padding

2.1 Core role and high-frequency combination

Stride and padding are "switches" that control the size change of the feature map space:

Step: The distance the convolution kernel slides each time. The larger the step size, the faster the feature map shrinks, the amount of calculation and parameters also decreases, and the receptive field grows faster;
Padding: Pad zeros (or other values) at the edges of the input feature map. The main purpose is to retain edge information and control the proportional relationship between the output size and the input.

Three classic combinations of industry and academia:

Combination name	Setting rules	Feature map size changes	Typical application scenarios
Same Padding	Padding number = convolution kernel size // 2, stride = 1	Spatial size unchanged	Intermediate feature extraction layer
Downsampling	Filling number = convolution kernel size // 2, step size = 2	Halve the space size	Reduce the amount of calculation and quickly increase the receptive field
Valid Padding	Padding number = 0	Space size reduction	Early compressed CNN, rarely used alone

2.2 Two special convolutions

In addition to ordinary convolutions, there are two important extensions that can change the receptive field or upsample without increasing the number of parameters:

Dilated Convolution: Insert "holes" between the weights of the convolution kernel, without increasing the number of parameters or reducing the resolution, but can greatly expand the receptive field. Ideal for tasks such as semantic segmentation and target detection that require large context but retain high-resolution features;
Transposed Convolution: can be regarded as the inverse operation of ordinary convolution, used for upsampling of feature maps, such as restoring low-resolution features to the input size in segmentation tasks, or generating images in generative adversarial networks.

Minimalist runnable code demonstration:

x = torch.randn(1, 3, 32, 32)           # 1张3通道 32×32 的图像

# 空洞卷积（dilation=2，3×3 核的感受野扩大到 7×7）
dilated_conv = nn.Conv2d(3, 64, 3, padding=2, dilation=2, bias=False)
# 转置卷积（上采样2倍，输出尺寸 = 32×2 = 64）
transpose_conv = nn.ConvTranspose2d(64, 3, 3, stride=2, padding=1, output_padding=1, bias=False)

print(f"空洞卷积输出形状: {dilated_conv(x).shape}")                  # (1, 64, 32, 32)
print(f"转置卷积输出形状: {transpose_conv(dilated_conv(x)).shape}")  # (1, 3, 64, 64)

3. Detailed explanation of pooling operation

3.1 Core role and high-frequency methods

The main goal of pooling is to reduce the spatial size of the feature map (reduce the amount of subsequent calculations and parameters), and at the same time enhance the model's robustness to small translations (that is, translation invariance). You can insert pooling between feature extraction layers to remove redundant information like "compression".

Comparison of three high-frequency pooling methods:

Pooling method	Core logic	Advantages and disadvantages	Typical application scenarios
Max Pooling	Take the maximum value within the sliding window	Retain strong features and be insensitive to noise, but discard detailed information	Most CNN feature extraction downsampling
Average pooling (Avg Pooling)	Take the average value within the sliding window	Smooth the overall information and weaken local strong features	Early CNN, some noise reduction scenes
Global average pooling (GAP)	Taking an average value for the entire feature map	Directly replacing the fully connected layer, greatly reducing the number of parameters and preventing overfitting	The last layer of the classification network

3.2 Intuitive code demonstration

Here, a 4×4 virtual single-channel feature map is used to show the specific effects of the three pooling methods:

import torch
import torch.nn as nn

# 构造一个 4×4 的虚拟特征图 (batch=1, channel=1, 4×4)
virtual_feature = torch.tensor([[[[1, 2, 3, 4],
                                   [5, 6, 7, 8],
                                   [9, 10, 11, 12],
                                   [13, 14, 15, 16]]]], dtype=torch.float32)

max_pool = nn.MaxPool2d(2, 2)     # 2×2窗口，步长2
avg_pool = nn.AvgPool2d(2, 2)
gap = nn.AdaptiveAvgPool2d((1, 1)) # 输出固定为1×1

print(f"最大池化输出:\n{max_pool(virtual_feature)[0, 0]}")     # [[6, 8], [14, 16]]
print(f"平均池化输出:\n{avg_pool(virtual_feature)[0, 0]}")     # [[3.5, 5.5], [11.5, 13.5]]
print(f"全局平均池化输出: {gap(virtual_feature).item():.1f}")  # 8.5

4. Detailed explanation of Receptive Field

4.1 Core concepts and intuitive calculations

The receptive field refers to: How much area in the original input image can be "seen" by a pixel on the output feature map. The larger the receptive field, the richer contextual information the model can utilize, but the computational cost will also increase accordingly. You can think of the receptive field as the "field of view" of the model - the wider the field of view, the better it can capture global relationships, but too wide a field of view may also dilute local details.

We don’t need complicated formulas and summarize the "layer-by-layer superposition method" to estimate the receptive field:

The initial receptive field of the input image is 1 (one pixel can only see itself);
After each layer of convolution or pooling, the receptive field will increase(该层核大小 - 1) × 之前所有层的累积步长；
Then update the cumulative step size:累积步长 = 累積步长 × 该层步长。

Taking the first 4 layers of ResNet-18 as an example, the process of layer-by-layer stacking is visually demonstrated:

def calc_rf_demo(layers_info):
    rf = 1                 # 初始感受野
    effective_stride = 1   # 累积步长
    print(f"{'层序号':<4} {'核大小':<6} {'当前步长':<8} {'累积步长':<12} {'累积感受野':<10}")
    print("-" * 60)
    for idx, (k, s) in enumerate(layers_info, 1):
        rf += (k - 1) * effective_stride
        effective_stride *= s
        print(f"{idx:<6} {k:<8} {s:<10} {effective_stride:<14} {rf:<12}")
    return rf

# ResNet-18 前4层：(核大小, 步长)
# conv1(7×7, s=2) → maxpool(3×3, s=2) → conv2_x 第一个3×3 (s=1) → 第二个3×3 (s=1)
resnet18_layers = [(7, 2), (3, 2), (3, 1), (3, 1)]
final_rf = calc_rf_demo(resnet18_layers)
print(f"\n最终感受野: {final_rf}×{final_rf}，累积步长: 4")

The running results will show that after these 4 layers, a single output pixel can cover a 27×27 area on the original input.

4.2 Practical tips for optimizing receptive fields

Prefer to use atrous convolution instead of large kernels: When a large receptive field is required but the resolution is not wanted (such as image segmentation), atrous convolution is the first choice;
Using multi-scale parallel modules: Similar to the Inception network, convolution kernels of different sizes are used in parallel on the same layer to capture targets of different sizes at the same time;
Progressively increase the receptive field: Do not use very large kernels or large strides at the beginning, gradually stack 3×3 convolutions and occasionally add downsampling with a stride of 2 to make training more stable.

5. Practical combat: Constructing a general and efficient convolution module

Combining all the previous knowledge points, we can write a general convolution module suitable for most visual tasks:

import torch.nn as nn

class UniversalEfficientConv(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1,
                 dilation=1, use_act=True):
        """
        自动计算Same Padding，支持空洞卷积和可选激活
        """
        super().__init__()
        # 填充数 = 扩张率×(核大小-1)//2，保持输入输出尺寸一致（stride=1时）
        padding = dilation * (kernel_size - 1) // 2

        self.layers = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size, stride=stride,
                      padding=padding, dilation=dilation, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True) if use_act else nn.Identity()
        )

    def forward(self, x):
        return self.layers(x)

CNN Design Tips

✓ Prioritize the use of 3×3 convolution stack instead of large-size convolution kernels; ✓ Reasonably introduce 1×1 convolution to reduce dimensionality and reduce the calculation amount of subsequent layers; ✓ The step size is 2 and used with Same Padding for downsampling to compress the size while retaining information; ✓ Use Global Average Pooling (GAP) to replace the fully connected layer at the end of the classification task to prevent overfitting and significantly reduce parameters; ✓ Plan the size of the receptive field according to the specific task (small target detection does not require an excessively large receptive field).

6. Summary

Convolution kernel, step size, filling and pooling are the "basic bricks" of CNN. By mastering their design logic, you can easily understand and even design the classic CNN architecture:

Convolution kernel: determines the range and parameter amount of local feature extraction. Parameter sharing makes the network lightweight and has translational variability;
Step + Fill: Control the change of space size and the retention of edge information;
Pooling: Reduce dimensionality and enhance the translation invariance of the model;
Receptive Field: Measures the context understanding ability of the model and is an important basis for network structure design.

Use PyTorch to build a simple MNIST handwritten digit classification CNN, adjust the convolution kernel size, step size and pooling method, and observe the changes in training speed and accuracy - this is the fastest way to master these parameters!

🔗 Extended reading

A guide to receptive field arithmetic for Convolutional Neural Networks

#Convolution kernel, stride and pooling: A complete guide to receptive field, parameter sharing and feature extraction

#Introduction

#1. Detailed explanation of convolution kernel size

#1.1 Comparison of core concepts and parameters

#1.2 The “magic” of 1×1 convolution

#2. Stride and Padding

#2.1 Core role and high-frequency combination

#2.2 Two special convolutions

#3. Detailed explanation of pooling operation

#3.1 Core role and high-frequency methods

#3.2 Intuitive code demonstration

#4. Detailed explanation of Receptive Field

#4.1 Core concepts and intuitive calculations

#4.2 Practical tips for optimizing receptive fields

#5. Practical combat: Constructing a general and efficient convolution module

#CNN Design Tips

#6. Summary

#Related tutorials

Convolution kernel, stride and pooling: A complete guide to receptive field, parameter sharing and feature extraction