Convolution kernel, stride and pooling: A complete guide to receptive field, parameter sharing and feature extraction
Introduction
Convolution kernel size, step size, padding and pooling are the four core elements for building an efficient CNN architecture - they directly determine the feature map size, parameter amount, computational efficiency and receptive field range. This article will explain these concepts in a concise and thorough way, interspersed with the principle of parameter sharing, to help you quickly master the basic design logic of CNN.
📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 从全连接到卷积 · 经典 CNN 架构剖析
1. Detailed explanation of convolution kernel size
1.1 Comparison of core concepts and parameters
The convolution kernel (Kernel) is a small matrix composed of learnable weights. It will slide position by position on the input feature map according to the step size, and perform the operation of "element-by-element multiplication and summation". The essence is to extract local features.
Parameter sharing is one of the biggest highlights of the convolutional layer: the same convolution kernel shares the same set of weights at all positions in the entire image. No matter how big the input image is, each kernel only needs to learn the kernel parameters once, which greatly reduces the number of parameters and gives the network translational equivariance - for example, whether a face appears on the left or the right, the kernel can detect it with the same pattern.
The following uses the scenario of "inputting a 3-channel RGB image → outputting a 64-channel feature" to intuitively feel the difference in parameter amounts of common kernel sizes:
💡 Alternative logic supplement: Two consecutive 3×3 convolutions, with the same number of input and output channels, can cover exactly the same receptive field area as a 5×5, but the number of parameters can be reduced by about 28%. Moreover, one more nonlinear activation (such as ReLU) can be inserted in the middle, making the network expressive more powerful.
The following code verifies the parameter amounts of different convolution kernels, and compares the difference between "single 5×5" and "two consecutive 3×3" (assuming that the input and output channels are both 64):
It can be seen that under the same input and output channels, using two 3×3 instead of one 5×5, the number of parameters is reduced from 102,400 to 73,728, while the receptive field remains unchanged. In the actual architecture, bottleneck design (1×1 dimensionality reduction) will be combined to further compress the number of parameters, which will be explained in detail later.
1.2 The “magic” of 1×1 convolution
1×1 convolution is known as the “Swiss Army Knife” in CNN architecture. Although it is small in size, it is very powerful. Its three core functions:
- Channel Fusion: Mix information from all channels at the same spatial location, such as integrating RGB color features;
- Dimensionality reduction/dimensionality increase: Flexibly adjust the number of channels to effectively reduce the calculation amount of subsequent convolution (ResNet bottleneck layer and MobileNet both rely on it);
- Inject nonlinearity: Cooperate with the ReLU activation function to increase the nonlinear expression ability of the model without changing the spatial size.
Minimalist implementation of ResNet bottleneck block (retaining only core logic):
2. Stride and Padding
2.1 Core role and high-frequency combination
Stride and padding are "switches" that control the size change of the feature map space:
- Step: The distance the convolution kernel slides each time. The larger the step size, the faster the feature map shrinks, the amount of calculation and parameters also decreases, and the receptive field grows faster;
- Padding: Pad zeros (or other values) at the edges of the input feature map. The main purpose is to retain edge information and control the proportional relationship between the output size and the input.
Three classic combinations of industry and academia:
2.2 Two special convolutions
In addition to ordinary convolutions, there are two important extensions that can change the receptive field or upsample without increasing the number of parameters:
- Dilated Convolution: Insert "holes" between the weights of the convolution kernel, without increasing the number of parameters or reducing the resolution, but can greatly expand the receptive field. Ideal for tasks such as semantic segmentation and target detection that require large context but retain high-resolution features;
- Transposed Convolution: can be regarded as the inverse operation of ordinary convolution, used for upsampling of feature maps, such as restoring low-resolution features to the input size in segmentation tasks, or generating images in generative adversarial networks.
Minimalist runnable code demonstration:
3. Detailed explanation of pooling operation
3.1 Core role and high-frequency methods
The main goal of pooling is to reduce the spatial size of the feature map (reduce the amount of subsequent calculations and parameters), and at the same time enhance the model's robustness to small translations (that is, translation invariance). You can insert pooling between feature extraction layers to remove redundant information like "compression".
Comparison of three high-frequency pooling methods:
3.2 Intuitive code demonstration
Here, a 4×4 virtual single-channel feature map is used to show the specific effects of the three pooling methods:
4. Detailed explanation of Receptive Field
4.1 Core concepts and intuitive calculations
The receptive field refers to: How much area in the original input image can be "seen" by a pixel on the output feature map. The larger the receptive field, the richer contextual information the model can utilize, but the computational cost will also increase accordingly. You can think of the receptive field as the "field of view" of the model - the wider the field of view, the better it can capture global relationships, but too wide a field of view may also dilute local details.
We don’t need complicated formulas and summarize the "layer-by-layer superposition method" to estimate the receptive field:
- The initial receptive field of the input image is 1 (one pixel can only see itself);
- After each layer of convolution or pooling, the receptive field will increase
(该层核大小 - 1) × 之前所有层的累积步长; - Then update the cumulative step size:
累积步长 = 累積步长 × 该层步长。
Taking the first 4 layers of ResNet-18 as an example, the process of layer-by-layer stacking is visually demonstrated:
The running results will show that after these 4 layers, a single output pixel can cover a 27×27 area on the original input.
4.2 Practical tips for optimizing receptive fields
- Prefer to use atrous convolution instead of large kernels: When a large receptive field is required but the resolution is not wanted (such as image segmentation), atrous convolution is the first choice;
- Using multi-scale parallel modules: Similar to the Inception network, convolution kernels of different sizes are used in parallel on the same layer to capture targets of different sizes at the same time;
- Progressively increase the receptive field: Do not use very large kernels or large strides at the beginning, gradually stack 3×3 convolutions and occasionally add downsampling with a stride of 2 to make training more stable.
5. Practical combat: Constructing a general and efficient convolution module
Combining all the previous knowledge points, we can write a general convolution module suitable for most visual tasks:
CNN Design Tips
✓ Prioritize the use of 3×3 convolution stack instead of large-size convolution kernels; ✓ Reasonably introduce 1×1 convolution to reduce dimensionality and reduce the calculation amount of subsequent layers; ✓ The step size is 2 and used with Same Padding for downsampling to compress the size while retaining information; ✓ Use Global Average Pooling (GAP) to replace the fully connected layer at the end of the classification task to prevent overfitting and significantly reduce parameters; ✓ Plan the size of the receptive field according to the specific task (small target detection does not require an excessively large receptive field).
6. Summary
Convolution kernel, step size, filling and pooling are the "basic bricks" of CNN. By mastering their design logic, you can easily understand and even design the classic CNN architecture:
- Convolution kernel: determines the range and parameter amount of local feature extraction. Parameter sharing makes the network lightweight and has translational variability;
- Step + Fill: Control the change of space size and the retention of edge information;
- Pooling: Reduce dimensionality and enhance the translation invariance of the model;
- Receptive Field: Measures the context understanding ability of the model and is an important basis for network structure design.
Related tutorials
🔗 Extended reading

