From fully connected to convolution: Why does computer vision need convolutional layers?

Introduction

Imagine that you took a cute photo of your own orange cat and just moved it from the upper left corner to the lower right corner of the picture. As a result, the AI model told you - "These are two completely different pictures"! This sounds ridiculous, but fully connected neural networks can really make this mistake when faced with images.

The emergence of convolutional neural networks (CNN) has completely solved this type of problem. It relies on two core mechanisms - Parameter Sharing and Local Connection, to compress millions of parameters to the order of thousands or even hundreds, while firmly remembering the shape, edges and texture of objects in the image, no matter where they appear.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 卷积核、步长与池化 · 经典 CNN 架构剖析

1. Why is the fully connected layer “unable to handle” the image?

1.1 Basic review: How fully connected layers work on images

Fully Connected Layer is the simplest way to combine neurons, but it is also the "most violent" way to process images: Each input pixel must establish an independent connection with all output neurons.

Describe its calculation process in more straightforward terms:

Suppose the input is a color image of height H, width W, number of channels C, with shape(H, W, C)。
In the first step, this image must be flattened into a one-dimensional vector, lengthn = H × W × C, all spatial structures are broken up.
Weight matrixWThe size ofm × n,inmis the number of output neurons.
Biasbhavemvalue.
Finally get the output through matrix multiplication:y = W @ x_flat + b, the shape is(m,)。

"""
全连接层（Fully Connected Layer）简洁计算流程：

输入形状：(H, W, C)  →  展平为长度 n = H*W*C 的向量 x_flat
权重 W 形状：(m, n)  偏置 b 形状：(m,)
输出 y = W @ x_flat + b，形状为 (m,)
"""

As you can see, this process does not have any special processing of "what the image looks like", it just treats each pixel as an isolated value.

1.2 Three fatal flaws

① Parameter explosion: An ordinary photo of 224×224 requires 150 million parameters!

We use a simple Python function to get an intuition for how many parameters are needed for images of different sizes (assuming a fully connected layer maps pixels to 1000 hidden neurons):

import torch
import torch.nn as nn

def count_fc_params(image_h, image_w, image_c, hidden_dims=1000):
    """计算全连接层单隐层的参数量"""
    flat_dim = image_h * image_w * image_c
    # 权重参数量 + 偏置参数量
    return flat_dim * hidden_dims + hidden_dims

# 对比常用图像尺寸（单隐层到 1000 神经元）
image_configs = [
    ("MNIST (28×28×1)", 28, 28, 1),
    ("CIFAR-10 (32×32×3)", 32, 32, 3),
    ("ImageNet (224×224×3)", 224, 224, 3)
]

print("全连接层参数爆炸演示：")
print("-" * 60)
for name, h, w, c in image_configs:
    params = count_fc_params(h, w, c)
    print(f"{name:<25} → 参数量: {params:>13,} ({params/1e6:.2f}M)")

Run this code and you will see shocking numbers:

全连接层参数爆炸演示：
------------------------------------------------------------
MNIST (28×28×1)         → 参数量:       785,000 (0.79M)
CIFAR-10 (32×32×3)      → 参数量:     3,073,000 (3.07M)
ImageNet (224×224×3)    → 参数量:   150,529,000 (150.53M)

Just one fully connected layer, combined with ImageNet-level small images, will consume 150 million parameters - this is just the beginning. If you add a few layers later, training will be almost impossible.

② Lack of spatial perception: good pictures are "forcibly dismantled"

The first step of the fully connected layer is to flatten, which means that originally adjacent pixels (such as the pixels around the cat's eyes) may be separated far apart after flattening, and the spatial relationship is completely lost. If you randomly shuffle the pixel order of a photo of a cat, the flattened vector will not make a fundamental difference to the fully connected layer - but it will no longer describe a cat.

③ The risk of over-fitting is extremely high: the model can only “memorize answers” and cannot “understand” images.

The MNIST data set only has 60,000 training images, but the single hidden layer fully connected network above has 785,000 parameters - the number of parameters is 13 times that of the training samples. This means that the model can completely "remember" the answer for each training sample without learning any common features. Once it is given images it has never seen before, its accuracy drops off a cliff.

2. Three “black technologies” of convolutional layers

Core mechanism ①: Local Connectivity

This is consistent with the intuition of biological vision - each photoreceptor cell in our retina is only sensitive to a small area in the field of view, rather than the entire field of view. CNN draws on this idea: Each output neuron is no longer connected to all inputs, but is only connected to a small window on the input image (the area covered by the convolution kernel).

"""
局部连接参数量计算示例：
对比 224×224×3 图像，输出 1000 个结果的情况：
- 全连接：150.53M 参数
- 局部连接（假设每个输出只连接 3×3×3 的窗口）：3×3×3×1000 = 27,000 参数
直接减少了 5000 多倍！
"""

This design not only greatly compresses the parameters, but also naturally preserves the spatial structure of the image - what each output neuron "sees" is the characteristics of a small patch of the image.

Local connections have reduced parameters a lot, but the convolutional layer has something even more subtle: The same convolution kernel (feature detector) will be used repeatedly on the entire image, and the parameters are completely shared.

It can be understood this way: a convolution kernel that specializes in detecting "vertical edges" should be able to detect whether a vertical edge appears in the upper left corner or lower right corner of the picture. We do not need to learn a separate "upper left corner vertical edge detector" and "lower right corner vertical edge detector" for each position, only one is enough. This completely frees the number of parameters from the constraints of image size.

import torch
import torch.nn as nn

def count_conv_params(in_channels, out_channels, kernel_size=3):
    """计算卷积层的参数量"""
    # 卷积核权重：out_channels × in_channels × kernel_size × kernel_size
    # 偏置：out_channels
    return out_channels * in_channels * kernel_size**2 + out_channels

# 对比全连接和卷积（以 CIFAR-10 预处理为例）
fc_params = count_fc_params(32, 32, 3, 64)
conv_params = count_conv_params(3, 64, 3)

print("全连接 vs 卷积参数对比（CIFAR-10 → 64 特征）：")
print("-" * 70)
print(f"全连接单隐层: {fc_params:>13,} ({fc_params/1e6:.2f}M)")
print(f"3×3 卷积层:   {conv_params:>13,} ({conv_params/1e6:.2f}M)")
print(f"参数减少比例:   {(1 - conv_params/fc_params)*100:.1f}%")

It can be clearly seen from the output that even if only 64 features are extracted, the parameters of the convolutional layer are much lower than those of the fully connected layer, and this gap will become more exaggerated as the image becomes larger.

Core mechanism ③: Translation Invariance

Since the same feature detector scans the entire image, when the object translates in the image, the detected feature response will move with it, but will not disappear. By adding a pooling layer later, the model can further ignore the precise location of the object and only care about "whether a certain feature has appeared before." This is the key reason why convolutional neural networks are extremely robust in image recognition.

3. Intuitive principle of convolution (simplified version)

In deep learning, the actual operation we use is usually called Cross-Correlation, rather than convolution in the strict mathematical sense (the latter requires flipping the convolution kernel 180° first). The effects of the two are essentially the same, and the mutual correlation is more intuitive: treat the convolution kernel as a "template", slide it on the input image, and calculate the inner product of the template and the corresponding window again and again.

Intuitive implementation of two-dimensional cross-correlation

import torch

def simple_cross_corr(input_img, kernel):
    """简化的 2D 互相关实现（演示用）"""
    h, w = input_img.shape
    kh, kw = kernel.shape
    oh, ow = h - kh + 1, w - kw + 1  # 输出尺寸
    output = torch.zeros(oh, ow)
    
    for i in range(oh):
        for j in range(ow):
            # 提取输入的局部窗口
            window = input_img[i:i+kh, j:j+kw]
            # 对应元素相乘再求和（内积）
            output[i, j] = (window * kernel).sum()
    return output

# 示例：用 Sobel 核检测竖边
input_img = torch.tensor([
    [0, 0, 0, 0, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 0, 0, 0, 0]
], dtype=torch.float32)

sobel_vertical = torch.tensor([
    [-1, 0, 1],
    [-2, 0, 2],
    [-1, 0, 1]
], dtype=torch.float32)

output = simple_cross_corr(input_img, sobel_vertical)
print("Sobel 竖边检测结果：")
print(output)

This code simulates the core operation of the convolutional layer. We define a simple 5×5 image (with a 3×3 white square in the middle) and slide a classic Sobel vertical edge detection kernel over it. The high values in the resulting matrix exactly correspond to the positions of the vertical edges on the left and right sides of the box:

Sobel 竖边检测结果：
tensor([[0., 0., 0.],
        [0., 0., 4.],
        [0., 0., 0.]])

The convolutional layer completes the tasks of "local detection" and "feature extraction" in one step.

4. Minimalist practice: Use PyTorch to build a basic CNN

Next, we use the dimensions of CIFAR-10 to compare a minimalist fully connected network and a minimalist convolutional network to see how much the parameters can differ.

import torch
import torch.nn as nn

# 极简全连接网络
class SimpleFC(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(3*32*32, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )
    
    def forward(self, x):
        return self.layers(x)

# 极简卷积网络
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.AdaptiveAvgPool2d((1,1))  # 全局池化到 1×1
        )
        self.classifier = nn.Linear(64, 10)
    
    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        return self.classifier(x)

# 参数量对比
fc_net = SimpleFC()
cnn_net = SimpleCNN()
fc_params = sum(p.numel() for p in fc_net.parameters())
cnn_params = sum(p.numel() for p in cnn_net.parameters())

print("极简网络参数量对比（CIFAR-10 → 10 类）：")
print("-" * 60)
print(f"极简全连接: {fc_params:>13,} ({fc_params/1e6:.2f}M)")
print(f"极简 CNN:   {cnn_params:>13,} ({cnn_params/1e6:.2f}M)")
print(f"参数减少:     {(1 - cnn_params/fc_params)*100:.1f}%")

The output results will make you sigh once again at the efficiency of convolution:

极简网络参数量对比（CIFAR-10 → 10 类）：
------------------------------------------------------------
极简全连接:     1,578,506 (1.58M)
极简 CNN:          20,554 (0.02M)
参数减少:         98.7%

Note that even though this minimalist CNN only has about 20,000 parameters, its structure is already better at capturing local features in the image than the fully connected network with more than 1.5 million parameters, and it is faster to train and less likely to overfit.

5. Summary

The transition from fully connected to convolution is a revolution in the field of computer vision. It reshapes the way image processing is done with three core ideas:

Comparative dimensions	Fully connected layer	Convolutional layer
Number of parameters	Explosive growth (O(nm))	Substantial reduction (O(oc×ic×k²))
Spatial perception	Completely lost (must be flattened)	Perfectly preserved (partially connected)
Feature generalization	Easy to memorize by rote	Translation invariance + parameter sharing → strong generalization
Computational efficiency	Low (large matrix multiplication)	High (local window operation + efficient implementation)

Review of core concepts

Local connection: Each output neuron only looks at a small window of the input image, preserving spatial relationships.
Parameter Sharing: The same convolution kernel slides across the entire image, greatly reducing the number of parameters and making feature detection independent of position.
Translation invariance: After the object is translated in the image, the feature response will also move accordingly. With operations such as pooling, the model can ignore small changes in position.

💡 Study Suggestions Understanding these three core concepts is the key to getting started with CNN! In the next section, we will explain in depth the hyperparameters of convolution (convolution kernel size, stride, padding) and the details of the pooling layer to help you further understand the design logic of CNN.

🔗 Extended reading

#From fully connected to convolution: Why does computer vision need convolutional layers?

#Introduction

#1. Why is the fully connected layer “unable to handle” the image?

#1.1 Basic review: How fully connected layers work on images

#1.2 Three fatal flaws

#① Parameter explosion: An ordinary photo of 224×224 requires 150 million parameters!

#② Lack of spatial perception: good pictures are "forcibly dismantled"

#③ The risk of over-fitting is extremely high: the model can only “memorize answers” ​​and cannot “understand” images.

#2. Three “black technologies” of convolutional layers

#Core mechanism ①: Local Connectivity

#Core mechanism ②: Parameter Sharing

#Core mechanism ③: Translation Invariance

#3. Intuitive principle of convolution (simplified version)

#Intuitive implementation of two-dimensional cross-correlation

#4. Minimalist practice: Use PyTorch to build a basic CNN

#5. Summary

#Review of core concepts