title: Detailed explanation of common-deep-learning-models: from AlexNet to modern architecture | Daoman PythonAI description: In-depth analysis of the commonly used deep learning models in the field of computer vision, including the development process from AlexNet to Vision Transformer, including the architectural characteristics, application scenarios and implementation principles of each model. keywords: [Deep learning model, CNN, AlexNet, VGG, ResNet, YOLO, Vision Transformer, model architecture, computer vision]

common-deep-learning-models detailed explanation: from AlexNet to modern architecture

Introduction

A "best-in-class event" at the ImageNet competition in 2012 completely awakened the convolutional neural network that had been dormant for many years. Mr. Hinton and two students used 8 GTX 580 graphics cards to reduce the Top-5 error rate of the traditional SIFT/SVM combination of 26% to 15.3% in one go. All computer vision practitioners in the room were stunned. Since then, the visual model has been iterating like a fool: from single classification to full bloom in detection, segmentation, and generation; from cloud behemoth to mobile Qingqi; from pure CNN to Transformer to dominate the world.

This article divides this "Fight of the Gods" CV model evolution line into four eras according to Core Technology Nodes, and dismantles the innovation points and implementation value of each milestone model one by one. It also includes minimalist code diagrams and model selection tables to help you quickly establish your intuition of "what model to use and why."


1. The First Era: CNN Enlightenment and Standardization (2012 - 2014)

Features of the Times: Use CNN to replace artificial features such as SIFT/HOG, dominate ImageNet classification, and lay the basic architecture paradigm of CV.

【2012·Pioneering Work】AlexNet

Core Task: Large-Scale Image Classification One sentence breakthrough:

  • For the first time, the huge power of deep CNN was demonstrated on 1.2 million ImageNet images.
  • Introducing ReLU activation function to solve the problem of gradient disappearance, Dropout to prevent over-fitting, and data enhancement to expand samples with cropping, flipping, brightness adjustment, etc.
  • Dual GPU parallel training slightly lowers the computing power threshold

Implementation value: Almost no one uses AlexNet directly now, but it is the "enlightenment textbook" for all subsequent CNNs.


###【2013·Visualization Tool】ZFNet Core Task: Large-Scale Image Classification One sentence breakthrough:

  • Use Deconvolution Network (DeconvNet) to visualize what each layer of CNN is learning: the shallow layer captures edges and color blocks, the middle layer captures texture and shape, and the high layer captures object components and overall contours

Implementation value: Lay the foundation for model interpretive research and the "anchor perception" idea of ​​target detection.


[2014·Standardized Template] VGGNet

Core tasks: General feature extraction/image classification One sentence breakthrough:

  • Prove that "3×3 small convolution kernel + 2×2 pooling + continuous stacking and deepening"** is the simplest, most efficient and replicable paradigm for feature extraction**
  • Although the number of parameters is huge (VGG16 has 138 million), the structure is unified and regular, and it is extremely easy to modify as a visual backbone.

Implementation value: It is still shining in scenarios such as small data set classification and transfer learning pre-training.


[2014·Parallel Multi-Scale] GoogLeNet (Inception v1)

Core Task: Image Classification One sentence breakthrough:

  • Proposed Inception module: the same layer uses 1×1, 3×3, 5×5 convolution and 3×3 pooling in parallel to mix different receptive fields; at the same time, 1×1 convolution is used for “dimensionality reduction/dimensionality increase” to greatly compress the amount of parameters.
  • Replace the last few layers of full connections with Global Average Pooling (GAP), and the model parameters are only about 6 million

Implementation value: The concept of multi-scale feature fusion has influenced almost all subsequent detection and segmentation models.


2. The Second Era: Residual Breakthrough and Multi-task Implementation (2015 - 2017)

Features of the Times: Residual connections solve the problem of "deep network training degradation", and the model expands from single classification to detection, segmentation, and lightweight.

###【2015·Milestone】ResNet Core Mission: Universal Vision Backbone One sentence breakthrough:

  • Proposed Residual Block: Let the network learn "residuals from input to output" instead of direct mapping. Networks with more than 20 layers will no longer degrade, and can even be trained to 1000+ layers!
  • Among ResNet18/34/50/101/152 "Family Bucket", ResNet50 is still the most stable and versatile pre-training base in the industry**

Minimalist PyTorch residual block code diagram

import torch.nn as nn

class BasicBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        # 主路径:3×3卷积 -> BN -> ReLU -> 3×3卷积 -> BN
        self.main = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(out_channels)
        )
        # 捷径路径:如果stride≠1或通道数变了,就靠1×1卷积调整尺寸和通道
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        out = self.main(x) + self.shortcut(x)   # 核心:残差连接
        out = self.relu(out)
        return out

【2015·Real-time detection】YOLO v1

Core Mission: End-to-end real-time target detection One sentence breakthrough:

  • Treat target detection as a regression problem: the image is divided into S×S grids, each grid directly predicts B boxes, confidence and C category probabilities, and outputs end-to-end in one stage
  • Titan X runs up to 45fps, opening a new era of industrial-grade real-time detection

Implementation value: It lays the foundation for all subsequent single-stage detectors (YOLO series, SSD, etc.).


[2015·Accuracy Benchmark] Faster R-CNN

Core mission: High-precision target detection One sentence breakthrough:

  • Proposed Region Proposal Network (RPN) to replace the traditional Selective Search and greatly shorten the generation time of candidate areas.
  • Two-stage detection: RPN generates candidate frames → fine classification and regression. It is still used in scenarios with high accuracy requirements such as security and medical imaging.

Implementation value: Represents the pinnacle design idea of ​​the two-stage detector.


【2017·Computing Power Revolution】MobileNet v1

Core mission: Mobile/embedded lightweight classification detection One sentence breakthrough:

  • Introduced Depthwise Separable Convolution: split the standard 3×3 convolution into "channel-by-channel small convolution" (depth convolution) + "1×1 blending all channels" (point-by-point convolution), directly cutting the amount of parameters and calculations by about 90%
  • Provide α hyperparameter to control the model width and flexibly adapt to different computing power equipment

Implementation value: Let AI get rid of expensive graphics cards and be widely used in terminals such as mobile apps, drones, and security cameras.

Minimalist PyTorch depth separable convolution code diagram

import torch.nn as nn

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        # 1. 深度卷积:每个通道单独用3×3卷积
        self.depthwise = nn.Sequential(
            nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=stride, padding=1, groups=in_channels, bias=False),
            nn.BatchNorm2d(in_channels),
            nn.ReLU(inplace=True)
        )
        # 2. 逐点卷积:1×1卷积融合所有通道特征
        self.pointwise = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x

[2017·Pixel Level Segmentation] Mask R-CNN

Core Task: Instance Segmentation One sentence breakthrough:

  • Add a fully convolutional segmentation branch based on Faster R-CNN, and output detection frames, categories and instance masks at the same time
  • Use RoIAlign to replace RoIPooling to solve the pixel alignment problem in pooling and greatly improve the segmentation accuracy.

Implementation value: A powerful tool for refined pixel-level tasks such as autonomous driving, medical image annotation, and e-commerce cutout.


3. The Third Era: Attention Breakdown and Efficient Evolution (2018 - 2020)

Features of the Times: Transformer's attention mechanism invades CV, and the model learns to "focus" while continuing to evolve toward smaller, faster, and stronger.

[2018·Multi-scale details] DeepLabV3+

Core Task: Semantic Segmentation One sentence breakthrough:

  • Use Atrous Convolution to expand the receptive field without increasing the number of parameters and flexibly handle objects of different sizes.
  • Combined encoder-decoder structure: the encoder extracts high-level semantics, and the decoder gradually restores low-level details

Implementation value: Excellent performance in scenarios such as autonomous driving road route recognition and urban planning remote sensing segmentation.


【2019·Brush List Artifact】EfficientNet

Core Task: Universal and efficient classification/detection Backbone One sentence breakthrough:

  • Use Neural Architecture Search (NAS) to find the efficient baseline model EfficientNet-B0
  • Proposed a composite scaling method: simultaneously scaling depth (number of layers), width (number of channels), and resolution (input size), the accuracy of ImageNet Top-1 was brushed to 84.3%, and the number of parameters was only 1/10 of ResNet50

Implementation value: It is still one of the general classification pre-training models with the best balance between accuracy and efficiency.


[2020·Engineering Selection] YOLOv5

Core Mission: End-to-end real-time target detection One sentence breakthrough:

  • Although it is not an official version, the code structure is clear, the community is active, and the ecosystem is complete (pre-training model, quantification, pruning, multi-framework deployment support)
  • Incorporating engineering techniques such as Mosaic data enhancement, adaptive anchor frame, CSPDarknet Backbone, etc., both accuracy and speed are top-notch

Implementation value: Domestic small and medium-sized teams The target detection model with the highest installed base, which is almost "ready to use out of the box".


【2020·Paradigm Shift】Vision Transformer (ViT)

Core tasks: General visual feature extraction/image classification One sentence breakthrough:

  • Proved for the first time that the Pure Transformer structure can surpass CNN on very large-scale image data (such as JFT-300M)
  • Cut the image into 16×16 patches and send them to the Transformer as a "sequence", thus opening a new era of CV in which "everything is a sequence"

Implementation value: Paving the way for subsequent visual Transformers such as Swin Transformer and SegFormer.


4. The Fourth Age: Big Models and General Intelligence (2021 - Present)

Characteristics of the Times: Multi-modal fusion is on the rise, and models have strong generalization capabilities and zero-sample/few-sample recognition capabilities.

[2021·Picture and text association] CLIP

Core Task: Cross-modal understanding/zero-shot classification One sentence breakthrough:

  • Use Contrastive Learning to pre-train on 400 million pairs of "image and text data" to allow the model to learn the semantic correspondence between text and images
  • Achieve zero sample transfer: without retraining, images of any category can be classified using only text descriptions

Implementation value: Become the underlying core engine for AI painting and multi-modal search such as Stable Diffusion and Midjourney.


【2021·Partial Attention】Swin Transformer

Core Task: General Vision Backbone/Dense Prediction (Detection, Segmentation) One sentence breakthrough:

  • Proposed Hierarchical Sliding Window Attention Mechanism: The image is divided into non-overlapping windows, attention is only calculated within the window, and then the window is moved to fuse the information, reducing the complexity from O(n²) to O(n)
  • Excellent performance in intensive prediction tasks such as detection and segmentation, even surpassing the strongest CNN Backbone of the year

Implementation value: It is still one of the preferred general-purpose Backbones for intensive prediction tasks in the industry.


【2023·Universal Segmentation】SAM (Segment Anything)

Core Task: Universal zero-shot/few-shot image segmentation One sentence breakthrough:

  • Based on 11 million images and 1 billion masks pre-training, creating a new paradigm of "universal segmentation model"
  • No need to retrain, just give a point, box or text description to segment any target, and the cost of data annotation drops sharply.

Implementation value: It is widely used in e-commerce cutout, medical image annotation, video editing and other scenarios.


[2024 to present·End-to-end intelligence] YOLOv10 / InternVL, etc.

Core tasks: Real-time detection/multi-modal large models One sentence breakthrough:

  • YOLOv10: Removed redundant NMS post-processing, faster and more accurate
  • Multi-modal large models such as InternVL: Integrate ViT and LLM to allow robots to "look at pictures and speak" and "operate according to pictures"

Implementation Value: Represents the most cutting-edge technological development direction in the current CV field.


Generative model supplement (2014 - present)

Although the generative model does not entirely belong to the "discriminative visual model", it is an important part of modern CV:

  • GAN (2014): The originator of adversarial training, realizing image generation and style transfer
  • CycleGAN (2017): Image-to-image transformation without paired data (e.g. cat to dog, day to night)
  • Diffusion Models (2020): Generative model based on noise diffusion, the current mainstream technology for AI painting

5. Model Selection Guide

Here are two concise tables to help you quickly locate the appropriate model:

Table 1: Selection by task type

Task typeRecommended model (accuracy first)Recommended model (speed first)
Image ClassificationSwin Transformer, ViTMobileNetv3, EfficientNet-Lite
Object DetectionFaster R-CNN, DETRYOLOv8/10, SSD
Semantic SegmentationSegFormer, DeepLabV3+MobileNetv3-DeepLabV3+
Instance SegmentationMask R-CNN, SOLOv2YOLACT, YOLOv8-Seg
Universal SegmentationSAMSAM-Lite

Table 2: Select by deployment environment

Deployment environmentComputing power requirementsRecommended models
Cloud high-performance GPUExtremely highSwin Transformer-Large, ViT
Ordinary GPU in the cloudMedium to highResNet50/101, EfficientNet-B3/B4
Mobile/EmbeddedLow/Very LowMobileNetv3, EfficientNet-Lite, YOLOv8n

  1. Multi-modal fusion: Combining multi-modal modes such as vision + text + audio + point cloud to build a general large model
  2. Self-supervised learning: SimCLR, MAE and other methods reduce dependence on annotated data
  3. Model compression: Quantification, pruning, knowledge distillation and other technologies allow large models to run on the terminal
  4. Neural Architecture Search: Automatically design the optimal model structure, eliminating manual trial and error.

Future Directions

  1. Universal Vision Model: One model handles all tasks such as classification, detection, segmentation, and generation.
  2. Continuous Learning: The model continuously learns new tasks without forgetting old knowledge.
  3. Explainability: Improve the transparency of model decision-making and allow AI to "tell the truth clearly"
  4. Energy efficiency optimization: Reduce energy consumption while ensuring performance, adapting to scenarios such as robots and drones.

7. Summary

The development of deep learning models has experienced an evolution from simple to complex, from specialized to general, and from heavy to lightweight. Each milestone accurately solves a key pain point:

  • AlexNet proves the power of deep CNN
  • ResNet solves the training degradation of deep networks
  • ViT opens a new era of "everything is sequence"
  • CLIP/SAM achieves zero-sample/universal capabilities

Understanding this iteration line not only gives you the confidence to know when talking about technology, but also the core ability to accurately select tools when working on projects.


It is recommended to learn these models in timeline order, focusing on the **pain points** and **core innovation points** solved by each model, and at the same time paying attention to the inheritance and development relationship between models (for example, ResNet inherits the small convolution kernel paradigm of VGG, and Swin inherits the sequence idea of ​​ViT).

🔗 Extended reading