Swin Transformer: Detailed explanation of sliding window mechanism and hierarchical feature extraction
Swin Transformer is a revolutionary visual Transformer benchmark architecture proposed by Microsoft Research Asia in 2021. Through local sliding windows and hierarchical structures, it perfectly solves the fatal flaws of the original ViT (Vision Transformer) in computation efficiency and multi-scale feature extraction. It has become the preferred backbone network for full visual tasks such as image classification, target detection, and semantic segmentation.
📂 Stage: Stage 2 - Deep Learning Visual Basics (Visual Transformer Supplement) 🔗 Pre-reading: Vision Transformer (ViT) 详解 · 注意力机制
1. Core innovation: solving ViT’s two major pain points
The original ViT treats the image as "a bunch of independent flat patches", global attention leads to the calculation amount increasing with the square of the number of tokens, and the single resolution structure ** lacks CNN-style multi-scale inductive bias**.
1.1 Improvements compared to ViT
2. Detailed explanation of core mechanism
2.1 Window Attention (W-MSA)
The global attention is changed to be calculated within non-overlapping local windows, and the complexity is directly dimensionally reduced. At the same time, relative position coding is introduced to preserve local spatial relationships.
2.2 Shift Window (SW-MSA)
If only W-MSA is used, there is no information exchange between windows and the global receptive field cannot be simulated. Swin implements cross-window connections through periodic window shifting and uses masks to avoid invalid attention after shifting.
2.3 Patch Merging (core of hierarchical structure)
Similar to CNN's pooling + channel fusion, the resolution is halved, the number of channels is doubled, and a multi-scale feature pyramid is constructed:
- Take adjacent 2×2 patches
- Splicing channel (4C)
- Linear projection dimensionality reduction to 2C
3. Quickly get started with the pre-trained model
There are two most commonly used ways to load Swin pre-trained models: timm library (simple and efficient) and Hugging Face Transformers (more versatile).
3.1 Using timm library
3.2 Using Hugging Face
4. Summary and learning suggestions
Three cores of Swin Transformer:
- ✅ Local sliding window: Make the calculation amount linearly related to the number of tokens
- ✅ Shift window + mask: achieve global information flow across windows
- ✅ Patch Merging: Constructing a multi-scale feature pyramid
Study suggestions:
- First run through the pre-trained model through timm/Hugging Face
- Focus on understanding the mask generation and periodic shifting of SW-MSA
- You can try downstream task practice in combination with UpperNet/DINOv2
💡 Recommended Reading:

