From fully connected to convolution: Why does computer vision need convolutional layers?
Introduction
Imagine that you took a cute photo of your own orange cat and just moved it from the upper left corner to the lower right corner of the picture. As a result, the AI model told you - "These are two completely different pictures"! This sounds ridiculous, but fully connected neural networks can really make this mistake when faced with images.
The emergence of convolutional neural networks (CNN) has completely solved this type of problem. It relies on two core mechanisms - Parameter Sharing and Local Connection, to compress millions of parameters to the order of thousands or even hundreds, while firmly remembering the shape, edges and texture of objects in the image, no matter where they appear.
📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 卷积核、步长与池化 · 经典 CNN 架构剖析
1. Why is the fully connected layer “unable to handle” the image?
1.1 Basic review: How fully connected layers work on images
Fully Connected Layer is the simplest way to combine neurons, but it is also the "most violent" way to process images: Each input pixel must establish an independent connection with all output neurons.
Describe its calculation process in more straightforward terms:
- Suppose the input is a color image of height H, width W, number of channels C, with shape
(H, W, C)。 - In the first step, this image must be flattened into a one-dimensional vector, length
n = H × W × C, all spatial structures are broken up. - Weight matrix
WThe size ofm × n,inmis the number of output neurons. - Bias
bhavemvalue. - Finally get the output through matrix multiplication:
y = W @ x_flat + b, the shape is(m,)。
As you can see, this process does not have any special processing of "what the image looks like", it just treats each pixel as an isolated value.
1.2 Three fatal flaws
① Parameter explosion: An ordinary photo of 224×224 requires 150 million parameters!
We use a simple Python function to get an intuition for how many parameters are needed for images of different sizes (assuming a fully connected layer maps pixels to 1000 hidden neurons):
Run this code and you will see shocking numbers:
Just one fully connected layer, combined with ImageNet-level small images, will consume 150 million parameters - this is just the beginning. If you add a few layers later, training will be almost impossible.
② Lack of spatial perception: good pictures are "forcibly dismantled"
The first step of the fully connected layer is to flatten, which means that originally adjacent pixels (such as the pixels around the cat's eyes) may be separated far apart after flattening, and the spatial relationship is completely lost. If you randomly shuffle the pixel order of a photo of a cat, the flattened vector will not make a fundamental difference to the fully connected layer - but it will no longer describe a cat.
③ The risk of over-fitting is extremely high: the model can only “memorize answers” and cannot “understand” images.
The MNIST data set only has 60,000 training images, but the single hidden layer fully connected network above has 785,000 parameters - the number of parameters is 13 times that of the training samples. This means that the model can completely "remember" the answer for each training sample without learning any common features. Once it is given images it has never seen before, its accuracy drops off a cliff.
2. Three “black technologies” of convolutional layers
Core mechanism ①: Local Connectivity
This is consistent with the intuition of biological vision - each photoreceptor cell in our retina is only sensitive to a small area in the field of view, rather than the entire field of view. CNN draws on this idea: Each output neuron is no longer connected to all inputs, but is only connected to a small window on the input image (the area covered by the convolution kernel).
This design not only greatly compresses the parameters, but also naturally preserves the spatial structure of the image - what each output neuron "sees" is the characteristics of a small patch of the image.
Core mechanism ②: Parameter Sharing
Local connections have reduced parameters a lot, but the convolutional layer has something even more subtle: The same convolution kernel (feature detector) will be used repeatedly on the entire image, and the parameters are completely shared.
It can be understood this way: a convolution kernel that specializes in detecting "vertical edges" should be able to detect whether a vertical edge appears in the upper left corner or lower right corner of the picture. We do not need to learn a separate "upper left corner vertical edge detector" and "lower right corner vertical edge detector" for each position, only one is enough. This completely frees the number of parameters from the constraints of image size.
It can be clearly seen from the output that even if only 64 features are extracted, the parameters of the convolutional layer are much lower than those of the fully connected layer, and this gap will become more exaggerated as the image becomes larger.
Core mechanism ③: Translation Invariance
Since the same feature detector scans the entire image, when the object translates in the image, the detected feature response will move with it, but will not disappear. By adding a pooling layer later, the model can further ignore the precise location of the object and only care about "whether a certain feature has appeared before." This is the key reason why convolutional neural networks are extremely robust in image recognition.
3. Intuitive principle of convolution (simplified version)
In deep learning, the actual operation we use is usually called Cross-Correlation, rather than convolution in the strict mathematical sense (the latter requires flipping the convolution kernel 180° first). The effects of the two are essentially the same, and the mutual correlation is more intuitive: treat the convolution kernel as a "template", slide it on the input image, and calculate the inner product of the template and the corresponding window again and again.
Intuitive implementation of two-dimensional cross-correlation
This code simulates the core operation of the convolutional layer. We define a simple 5×5 image (with a 3×3 white square in the middle) and slide a classic Sobel vertical edge detection kernel over it. The high values in the resulting matrix exactly correspond to the positions of the vertical edges on the left and right sides of the box:
The convolutional layer completes the tasks of "local detection" and "feature extraction" in one step.
4. Minimalist practice: Use PyTorch to build a basic CNN
Next, we use the dimensions of CIFAR-10 to compare a minimalist fully connected network and a minimalist convolutional network to see how much the parameters can differ.
The output results will make you sigh once again at the efficiency of convolution:
Note that even though this minimalist CNN only has about 20,000 parameters, its structure is already better at capturing local features in the image than the fully connected network with more than 1.5 million parameters, and it is faster to train and less likely to overfit.
5. Summary
The transition from fully connected to convolution is a revolution in the field of computer vision. It reshapes the way image processing is done with three core ideas:
Review of core concepts
- Local connection: Each output neuron only looks at a small window of the input image, preserving spatial relationships.
- Parameter Sharing: The same convolution kernel slides across the entire image, greatly reducing the number of parameters and making feature detection independent of position.
- Translation invariance: After the object is translated in the image, the feature response will also move accordingly. With operations such as pooling, the model can ignore small changes in position.
💡 Study Suggestions Understanding these three core concepts is the key to getting started with CNN! In the next section, we will explain in depth the hyperparameters of convolution (convolution kernel size, stride, padding) and the details of the pooling layer to help you further understand the design logic of CNN.
🔗 Extended reading

