Semantic Segmentation: Detailed explanation of pixel-level image understanding and U-Net architecture
Introduction
Semantic segmentation is one of the core tasks in computer vision - it is not just "identifying what is in the picture", but also labeling each pixel with a unique semantic label to accurately outline the outline and spatial distribution of the object. From organ segmentation in medical imaging to road perception in autonomous driving, this technology is an indispensable foundation.
📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: YOLO 家族实战 · 关键点检测 (Keypoints)
1. Basic concepts of semantic segmentation
1.1 Comparison of task definition and related visual tasks
The input of semantic segmentation is an image with height H, width W, and number of channels C, and the output is a segmentation map with height H, width W, and depth N (N is the number of predefined categories). The network assigns each pixel position (i,j) a class label from the set {1,2,...,N}.
The core differences between it and other visual tasks are as follows:
1.2 Core application scenarios
The implementation scenarios of semantic segmentation are very wide:
- Medical Imaging: Organ/tumor segmentation, pathological slice analysis
- Autonomous Driving: Road/Lane/Obstacle Segmentation
- Remote sensing images: land use classification, urban planning, environmental monitoring
- Smart Agriculture: Crop/Pest and Disease Monitoring, Yield Estimation
- Robot: environment understanding, grasping and positioning
- Fashion/Entertainment: Clothing segmentation, virtual fitting, film and television post-production cutout
2. Classic semantic segmentation architecture
2.1 FCN: the pioneering work of fully convolutional network
FCN (Fully Convolutional Networks, 2015) is a milestone in semantic segmentation, achieving end-to-end pixel-level prediction for the first time.
Core Contribution
- Fully convolutional design: The fully connected layer of the classification network is removed and replaced with a convolutional layer, which supports input of any size;
- Deconvolution upsampling: Use transposed convolution (Transposed Convolution) to gradually restore spatial resolution;
- Skip Connections: Fusion of the low-level detail features of the encoder and the high-level semantic features of the decoder to solve the problem of detail loss after upsampling.
PyTorch implementation (FCN-8s)
2.2 U-Net: The “gold standard” for medical segmentation
U-Net (2015) was originally designed for biomedical image segmentation and has become one of the most commonly used infrastructures in the field of segmentation due to its symmetrical U-shaped structure and efficient skip connections.
Core Features
- Symmetric encoder-decoder: The encoder downsamples to extract semantics, and the decoder upsamples to restore spatial resolution;
- Concatenate skip connection: Concatenate encoder features and decoder features directly instead of element-by-element addition of FCN, retaining more details;
- Small data set friendly: It can achieve good results even on a small amount of labeled data.
PyTorch implementation
2.3 DeepLab: Atrous convolution and multi-scale modeling
The core of the DeepLab series (2016-2018) is Atrous Convolution (Atrous Convolution), which expands the receptive field without reducing the resolution, and introduces ASPP (Atrous Space Pyramid Pooling) to capture multi-scale contextual information.
Core components: ASPP
3. Semantic segmentation loss function
Segmentation tasks often face category imbalance (such as a very low proportion of tumor pixels in medical images), so in addition to standard cross-entropy, there are also the following dedicated losses:
Common loss code implementation
4. Data preprocessing and enhancement
The key to the segmentation task is that the image and the mask must be transformed simultaneously. It is recommended to useAlbumentationsLibrary (built-in sync transformation support).
Dedicated data augmentation strategy
Custom data set class
5. Model training and evaluation
Core training process
Core evaluation indicators: mIoU
6. Modern segmentation architecture trends
- Transformer empowerment: Hybrid/pure Transformer architectures such as SegFormer, Swin-Unet, and TransUNet have more advantages in long-distance modeling;
- Real-time segmentation: BiSeNet, DFANet, Fast-SCNN and other lightweight architectures balance speed and accuracy and adapt to mobile/autonomous driving scenarios;
- Large unified model: such as Mask2Former, which unifies semantic/instance/panoramic segmentation tasks.
7. Summary
The core of semantic segmentation is pixel-level classification. The evolution of the classic architecture revolves around "restoring spatial resolution" and "fusion of multi-scale information":
- FCN pioneers full convolution and skip connection;
- U-Net uses a symmetrical U-shaped structure as a universal basis;
- DeepLab introduces atrous convolution and ASPP to solve multi-scale problems.
🔗 Extended reading

