Model lightweighting: Detailed explanation of MobileNet, quantification, pruning and edge deployment
Introduction
Model lightweighting is the last mile for the implementation of deep learning. It allows us to greatly compress the model's parameter volume, calculation volume, and memory footprint without losing almost any accuracy, allowing powerful AI models to run smoothly on edge devices with limited resources such as mobile phones, smart cameras, and IoT chips. This tutorial will help you systematically master the lightweight architecture, model quantification, network pruning and deployment practices of the MobileNet series, covering the complete link from principles to code.
📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 3D 视觉基础 · 推理加速框架
1. Model lightweight core evaluation
1.1 Why lightweight?
1.2 Key evaluation indicators
- Parameter quantity (Params): The total number of model weights, which directly affects the model storage size (Float32=4 bytes/parameter)
- Computational amount (FLOPs): The number of floating point operations during inference, which determines the upper limit of hardware utilization
- Inference latency: The time taken for a single complete inference (millisecond level is the basis for edge applications)
- Memory Peak: The maximum memory occupied during inference (video memory/memory)
- Accuracy retention rate: The ratio of the Top1/Top5 accuracy rate after lightweighting to the original model
2. Lightweight network architecture: optimization from the design source
2.1 Core unit: Depthwise separable convolution
Depthwise separable convolutions are the cornerstone of the MobileNet family. It breaks down ordinary convolution into two steps:
- Depthwise convolution (Depthwise): Each input channel is processed separately with a 3×3 convolution, channels are not fused
- Pointwise convolution: Use 1×1 convolution to fuse the output of all channels, adjust the number of channels
💡 Comparison of calculation amount: Assume that the input feature map size is
H × W, input channelC_in, output channelC_out。
FLOPs for a regular 3×3 convolution are approx.9 × C_in × C_out × H × W,
FLOPs of depthwise separable convolution are approx.(9 × C_in + C_in × C_out) × H × W。
When the number of channels is large, the calculation amount is reduced by about 8~9 times, which greatly reduces the computing consumption.
2.2 MobileNetV1/V2 core implementation
MobileNetV1
MobileNetV1 uses width multiplierwidth_multiplierScale the number of channels in each layer proportionally to flexibly choose between accuracy and speed. The following code implements a complete V1 network that adjusts width as needed.
Improvements in MobileNetV2: Inverted Residual + Linear Bottleneck
MobileNetV2 has made two key optimizations to address the shortcomings of V1:
- Inverted residual structure: First 1×1 convolution expansion channel (allowing deep convolution to extract rich features in high dimensions), then perform depth convolution, and finally use 1×1 convolution compression channel to form an "expansion-convolution-compression" hourglass shape.
- Linear bottleneck: ReLU activation is no longer used after the compression layer to avoid destruction of low-dimensional spatial information.
3. Model quantization: reduce numerical accuracy
Quantization refers to converting the floating point weights and activation values of the model into low-precision integers (such as Int8), thereby achieving 4x volume compression and 2~3x inference acceleration (the effect is particularly obvious on hardware that supports Int8 acceleration).
3.1 PyTorch static quantization (most commonly used for deployment)
The standard process of static quantization: Train the Float32 model → Fusion BN + Conv → Calibrate quantization parameters → Convert to Int8 model.
4. Model pruning: remove redundant connections
Pruning significantly reduces the model size by removing unimportant weights or channels. PyTorch provides out-of-the-boxtorch.nn.utils.prunetool.
4.1 Commonly used pruning methods
- Unstructured pruning: removes individual weights with the smallest absolute value (L1 norm), changes the sparse distribution but does not change the network structure, and requires a sparse computing library to effectively accelerate.
- Structural Pruning: Directly remove the entire channel (such as sorting by L2 norm) and change the shape of the network, which can directly benefit general-purpose hardware.
⚠️ Usually a small amount of Fine-tuning is required to restore accuracy after pruning. Structured pruning is more suitable for general scenarios without customized hardware.
5. Deployment recommendations
5.1 Lightweight technology combination strategy
The industry often uses the sequence of "architecture design → pruning → quantification → fine-tuning" to gradually compress the model:
- Selection: Use MobileNetV2 / V3-Large / V3-Small, etc. as lightweight baselines
- Pruning: Use structured pruning (such as channel pruning) to remove redundancy
- Quantization: Apply static quantization or quantization-aware training to further reduce accuracy
- Fine-tuning: Retrain with a small amount of data to restore slight accuracy loss
5.2 Common deployment frameworks
Summarize
Model lightweighting is a key link in the implementation of deep learning, which allows AI to move from the cloud to various edge devices such as mobile phones, cameras, and drones. By mastering MobileNet series architecture design, model quantification, network pruning and other technologies, you can create efficient, accurate and lightweight inference models. I hope this tutorial can become a reliable reference for you on the road to lightweighting. You can constantly weigh and iterate in practice to find the deployment solution that best suits your business scenario.

