Vision Transformer: Detailed explanation from image slicing to Patch Embedding
Introduction
Vision Transformer (ViT) is a blockbuster in the field of computer vision. It boldly applies the highly successful Transformer architecture in natural language processing directly to image classification, and achieves impressive results. The core idea of ViT is simple but ingenious: Cut the image into small patches (patch), "feed" these image patches to the Transformer like words in a sentence, and let the model learn on its own which parts are worthy of attention.
This tutorial will use easy-to-understand language to dismantle the core design of ViT step by step: how to cut images, how to do patch embedding, how to add positional encoding, how to calculate multi-head attention... In addition to theoretical explanations, it will also give implementation code based on PyTorch and demonstrate how to use pre-trained models. Whether you are new to Visual Transformer or want to check for gaps, I hope this article will help you get started easily.
📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 关键点检测 (Keypoints) · Swin Transformer
1. The core idea of Vision Transformer
1.1 Understand pictures as “language”
Traditional convolutional neural networks (CNN) slowly expand the receptive field by continuously stacking convolutional layers, and gradually aggregate global semantics from local textures. The Vision Transformer sees the entire image as soon as it comes up, and each image block can directly interact with all other blocks. This "big picture view" is its power.
The main innovations of ViT can be summarized in three sentences:
- Image blocking: Cut the entire image into small squares (patch) of a fixed size. When expanded, each square is like a "token" (word) in NLP.
- Sequence Modeling: Treat these patches as a sequence, send them to the Transformer encoder, and use the self-attention mechanism to find the relationship between them.
- Global connection: Starting from the first layer, each patch can see all other patches, which is naturally suitable for capturing long-distance dependencies.
1.2 ViT development timeline
- 2017: The Transformer architecture was proposed in the paper "Attention Is All You Need".
- 2018: Models such as BERT make Transformer shine in the field of NLP.
- 2020: ViT is released, proving for the first time that pure Transformer can also surpass CNN in image classification.
- 2021 to present: Hierarchical ViTs such as Swin Transformer and PVT have emerged and are gradually becoming popular in detection and segmentation tasks.
2. Detailed explanation of ViT architecture
ViT's workflow can be summarized into four steps:
- Cut the image into patches and map them into fixed-length vectors (Patch Embedding).
- Add a class token specifically for classification at the beginning of the sequence.
- Add position coding to let the model know the original position of each patch in the image.
- Send it to the Transformer encoder, and finally use the output of class token for classification.
Below we dismantle each step in detail and combine it with the code to deepen our understanding.
2.1 Image Blocking and Patch Embedding
This is the first step in ViT and the key to converting an image from a "grid structure" to a "sequence".
Assuming the size of the input image is 224×224, we set the patch size to 16×16. In this way, the entire image is cut into 14 pieces in the horizontal and vertical directions, for a total of 14×14 = 196 patches. Each patch contains 16×16×3 = 768 pixel values, which are mapped to a new vector space through a linear projection layer (the dimensions can remain the same or change, the original ViT paper maintains 768 dimensions).
Patch Embedding process diagram: Input image: (B, 3, 224, 224) Split and expand: (B, 196, 768) Each patch vector length: 768
The following is the code to implement image tiles and projection using PyTorch and einops:
In the code,rearrangeReplace the original(B, C, H, W)The tensor of is rearranged into(B, num_patches, patch_dim), one line of code completes segmentation and flattening, clean and neat.
2.2 Complete implementation of ViT
After mastering Patch Embedding, we can build a complete Vision Transformer model. Here is a clean and readable PyTorch implementation:
This code basically replicates the structure of ViT. It is recommended that beginners read it line by line to understand the changes in the tensor shape at each step.
2.3 Detailed explanation of multi-head self-attention mechanism
Self-attention is the core of Transformer and the key to ViT’s ability to “see the whole picture”. Here I give an implementation that is closer to the original formula to help everyone understand the internal calculation process:
You can intuitively see from this code: The attention mechanism essentially allows each position in the sequence to weight and aggregate the information of all positions based on its similarity to all positions.
3. Positional encoding and Class Token
3.1 Position encoding: Let the model know "where you are"
Transformer itself does not have the ability to sense the input order, so position information must be injected into each patch. ViT uses learnable position encoding, which directly initializes a set of parameters and allows the model to adjust itself during the training process.
Commonly used position encoding types are:
- Learnable: ViT's default approach, simple and direct.
- Sinusoidal: No additional parameters required, but less effective in ViT.
- Two-dimensional position encoding (2D PE): retains the horizontal and vertical coordinate information of the patch, which is more suitable for images.
- Rotary Position Encoding (Rotary PE): Works well with large models and long sequences.
In ViT, the position encoding vector is added directly to the patch embedding, with the shape(num_patches + 1, dim). The reason why it is needed+1, because there is still a place reserved for class token.
3.2 Class Token: a special role that “oversees the overall situation”
ViT places a learnableclass token. This token does not come from any image patch, but after multiple layers of Transformer, it will gradually aggregate the information of all patches and become the "spokesperson" of the entire image. Ultimately we just need to extractclass tokenThe output vector is sent to the classification head to complete the classification.
The advantages of this design are:
- Centrally aggregate image-level semantic information.
- Avoids the need to design additional global pooling layers.
- with BERT in
[CLS]Tokens come from the same origin and are easy to understand and migrate.
4. Variations and improvements of ViT
4.1 DeiT: Data-efficient version of ViT
DeiT mainly solves the problem of ViT relying on massive data pre-training. It introduces a teacher model (usually a convolutional network) to guide ViT learning through knowledge distillation, so that a good model can be trained even with only a million-level data set. In addition, it uses stronger data augmentation and regularization methods.
4.2 More efficient ViT variants
The computational effort of pure ViT scales with the square of the number of patches, making it more expensive on high-resolution images. Subsequent research proposed many lightweight or efficient versions, such as:
- MobileViT: Incorporates the local advantages of convolution and is suitable for mobile devices.
- PVT (Pyramid Vision Transformer): uses gradually shrinking feature maps to form a pyramid structure similar to CNN.
- Swin Transformer: Calculate self-attention within a local window and achieve cross-window interaction by moving the window, significantly reducing the amount of calculation.
- Twins: Combines spatial attention and sequential self-attention, taking into account both local and global aspects.
These variants allow ViT not only to perform well in classification, but also to be efficiently used for downstream tasks such as detection and segmentation.
5. Use pre-trained models
As developers, we don't need to train ViT from scratch in most cases. Both PyTorch and Hugging Face provide pre-trained models out of the box.
5.1 Using PyTorch official model
5.2 Using Hugging Face Transformers
The interface of Hugging Face is very simple and very fast to get started, which is very suitable for quickly verifying ideas or building prototypes.
6. Comparison between ViT and CNN
ViT and CNN are not an either-or relationship. They each have their own strengths. Let’s compare them from several key dimensions:
Selection suggestions:
- Small data sets, real-time applications, and mobile deployment are still the areas of strength of CNN.
- In scenarios where there is sufficient data, higher accuracy is pursued, and multi-modal fusion is required, ViT will be more eye-catching.
7. Practical skills and tuning
To train or fine-tune a ViT, you can usually use the following techniques:
- Optimizer: Prioritize using AdamW, combined with weight decay.
- Learning rate scheduling: warmup + cosine annealing to make training more stable.
- Data enhancement: RandAugment, Mixup, CutMix, etc. can effectively improve generalization capabilities.
- Regularization: Dropout, Stochastic Depth, label smoothing.
- Knowledge Distillation: Use large models or CNN teacher models to guide small ViT, and the effect is significant.
If you want to deploy to production environment:
- Model Quantization: INT8 quantization can significantly reduce the size and accelerate inference.
- Mixed Precision Training: Save video memory and increase speed.
- Sparse Attention: Reduce the amount of computation by limiting the attention span.
- Distilled small model: You can also get performance close to that of large models on the mobile terminal.
Related tutorials
8. Summary
Vision Transformer proved to the world that image classification can be done well or even better without convolutions. Its core innovation points can be summarized into three links:
- Image Patch: Turn images into sequences, breaking through visual and language modeling barriers.
- Global self-attention: Each patch can directly model global dependencies and capture long-distance features.
- Scalability: Models can be increased in depth and width like a stack of Lego, further improving performance with large amounts of data.
Whether you are engaged in computer vision research or want to implement cutting-edge technology into products, Vision Transformer is a technology worthy of careful understanding.
💡 Important reminder: The emergence of ViT has opened a new era of unified modeling of visual and language models, and has also given birth to a series of phenomenal multi-modal models such as CLIP and DALL·E.
🔗 Extended reading

