title: Vision Transformer (ViT) Detailed Explanation: A Complete Guide from Theory to Practice | Daoman PythonAI description: A complete Vision Transformer (ViT) tutorial, with an in-depth analysis of ViT architecture, self-attention mechanism, PyTorch implementation, and comparative analysis with CNN, including code examples and practical application scenarios. keywords: [Vision Transformer, ViT, computer vision, Transformer, self-attention mechanism, PyTorch, image recognition, deep learning]
Vision Transformer (ViT) Explained: A complete guide from theory to practice
Introduction
If you ask computer vision researchers in 2019 "Can pure Transformer replace CNN for image classification?", the answer of most people will be NO. The lack of inductive bias and the computational cost of long sequence self-attention are inevitable flaws no matter how you look at it. But in 2020, Google Brain and DeepMind completely broke this perception with an article "An Image is Worth 16x16 Words": With the support of large-scale pre-training (14M+ annotated images), the pure attention model surpassed the then CNN SOTA in image classification for the first time, and officially kicked off the "Attention Era" in the field of CV.
This article will take you from scratch, step by step to dismantle the design idea of ViT, write a ViT-B/16 using PyTorch, and finally give training suggestions and how to use the ready-made pre-trained model. There are no complicated mathematical formulas in the whole process. As long as you have basic knowledge of convolutional neural networks and Python, you can get started quickly.
1. Background and motivation of ViT
1.1 Three core limitations of traditional CNN
CNN has dominated computer vision for almost ten years, but it inherently has several inconveniences:
-
Local receptive field, difficult to directly model long-distance dependencies Convolution only cares about local neighborhoods. If you want to understand the relationship between a cat's eyes and tail, you need to stack dozens or even dozens of layers to slowly transfer information. This "local-first" inductive bias is stable on small data sets, but it also limits the model's understanding of the global context.
-
Computational efficiency is subject to depth restrictions In order to cover the global information of a 224×224 image, ResNet requires 50 layers or even deeper networks. The gradient attenuates significantly during backpropagation, making training and optimization not easy.
-
When the input resolution changes, the amount of calculation explodes The growth of CNN's receptive field is roughly proportional to the number of layers multiplied by the convolution step size. When the image resolution is increased from 224×224 to 448×448, the calculation amount may directly increase by more than 4 times.
1.2 Why can Transformer cross borders?
Transformer proves in the field of NLP that "treating everything as a sequence and letting the model learn the relationship by itself" is a one-size-fits-all approach. Moving to CV brings three natural advantages:
-
Global receptive field, effective from the first layer Self-attention allows information at any location to be directly interacted, eliminating the need to pass it on layer by layer. The pixels in the upper left corner of the image can talk directly to the pixels in the lower right corner.
-
Flexible architecture, one-click scaling Want a larger model? Just adjust the number of layers, embedding dimensions, and number of attention heads. The same Transformer skeleton can cover different tasks such as classification, detection, and segmentation with slight modifications.
-
Parallel Computing Friendly Different from the sequential processing of RNN, the calculation of self-attention can be performed simultaneously, which greatly improves the training efficiency.
The idea of ViT is exactly this: **Cut the image into small patches (patches), treat them as words in NLP, and then throw them directly into the standard Transformer encoder to see what can be learned. **
2. ViT’s core architecture
2.1 Understand the workflow of ViT with one picture
The entire ViT process has only six steps and is very straightforward:
- Input image: A 224×224 color picture.
- Patch Embedding: Cut the image into 16×16 grids like cutting tofu. Each grid (patch) is mapped into a 768-dimensional vector through a convolution layer with a convolution kernel step size equal to the patch size.
- Class Token: Insert a learnable vector specifically for global classification at the front of the sequence, just like adding a "summary sentence" at the beginning of an article.
- Position encoding: In order to let the model know the spatial relationship between patches, a learnable vector is added to each position.
- Transformer Encoder: Standard multi-layer Transformer, each block contains multi-head self-attention and MLP.
- Classification header: Take the output of Class Token, pass through a layer of normalization and a linear layer, and output the final classification probability.
2.2 Implementing ViT-B/16 from scratch using PyTorch
Below we write the code directly, targeting a ViT-B/16 with the same configuration as the original paper. Detailed comments are added to each part, and the code can be run directly.
2.3 Code Tips
- Patch Embedding is essentially convolution: we directly use a
kernel_size=stride=162D convolution implementation, eliminating the trouble of manual slicing and linear projection. - One-time generation of Q, K, and V for multi-head attention: Compared with writing three linear layers separately, merge them into one
nn.Linear(embed_dim, embed_dim*3)Then split it, which is more efficient. - Positional encoding is fully learnable: The original ViT paper uses learnable positional encoding instead of the common sinusoidal positional encoding in NLP, allowing the model to fit the best spatial relationship by itself.
- Pre-LN architecture: LayerNorm is placed before attention/MLP, making training more stable.
3. ViT vs CNN: intuitive comparison
One sentence summary: **ViT consumes more data, but has a higher upper limit; CNN is more friendly to less data, but has limited potential in large-scale scenarios. ** This is why almost all large models in the industry today are built on Transformer.
4. Tips on training and use
4.1 Core training skills
- Pre-training data must be large enough: It is recommended to have at least 1 million images, and the best choice is ImageNet-21K (about 14 million images) or JFT-300M (used in the original paper).
- Data augmentation takes the drastic step: Don’t just use simple random cropping and flipping. Powerful enhancement methods such as RandAugment, MixUp, and CutMix are crucial to the effectiveness of ViT.
- **What to do with small and medium-sized data sets? ** Use knowledge distillation directly. DeiT uses RegNetY-16GF as the teacher model and can train powerful ViT on ImageNet-1K without additional data.
- Learning rate setting: During pre-training, the learning rate of ViT-B can be set to 3e-3, with cosine annealing and warmup.
4.2 Directly use the pre-trained model
In daily use, we almost do not need to train a ViT from scratch.torchvisionandtimmBoth libraries provide a wealth of pre-trained weights.
If you want more flexible model selection and configuration, it is recommended to usetimmLibrary:
Summarize
ViT is not trying to "overthrow CNN", but to introduce a more general and scalable modeling paradigm for computer vision. Most of the top vision models we see today - DETR, Mask2Former, CLIP, SAM - directly use ViT or its variants (Swin, DeiT, etc.) as the backbone network.
If you want to master ViT in depth, it is recommended to follow the following path:
- Run through the PyTorch code above and feel the flow of data for yourself.
- use
timmLoad a pretrained model and fine-tune it on CIFAR-100 or your own dataset. - Read the original paper An Image is Worth 16x16 Words to learn more detailed experiments and analysis.
- Then read the papers of DeiT and Swin Transformer to see how researchers solve ViT’s data hunger and hierarchical structure problems.
Related tutorials
🔗 Extended reading

