title: Vision Transformer (ViT) Detailed Explanation: A Complete Guide from Theory to Practice | Daoman PythonAI description: A complete Vision Transformer (ViT) tutorial, with an in-depth analysis of ViT architecture, self-attention mechanism, PyTorch implementation, and comparative analysis with CNN, including code examples and practical application scenarios. keywords: [Vision Transformer, ViT, computer vision, Transformer, self-attention mechanism, PyTorch, image recognition, deep learning]

Vision Transformer (ViT) Explained: A complete guide from theory to practice

Introduction

If you ask computer vision researchers in 2019 "Can pure Transformer replace CNN for image classification?", the answer of most people will be NO. The lack of inductive bias and the computational cost of long sequence self-attention are inevitable flaws no matter how you look at it. But in 2020, Google Brain and DeepMind completely broke this perception with an article "An Image is Worth 16x16 Words": With the support of large-scale pre-training (14M+ annotated images), the pure attention model surpassed the then CNN SOTA in image classification for the first time, and officially kicked off the "Attention Era" in the field of CV.

This article will take you from scratch, step by step to dismantle the design idea of ViT, write a ViT-B/16 using PyTorch, and finally give training suggestions and how to use the ready-made pre-trained model. There are no complicated mathematical formulas in the whole process. As long as you have basic knowledge of convolutional neural networks and Python, you can get started quickly.

1. Background and motivation of ViT

1.1 Three core limitations of traditional CNN

CNN has dominated computer vision for almost ten years, but it inherently has several inconveniences:

Local receptive field, difficult to directly model long-distance dependencies Convolution only cares about local neighborhoods. If you want to understand the relationship between a cat's eyes and tail, you need to stack dozens or even dozens of layers to slowly transfer information. This "local-first" inductive bias is stable on small data sets, but it also limits the model's understanding of the global context.
Computational efficiency is subject to depth restrictions In order to cover the global information of a 224×224 image, ResNet requires 50 layers or even deeper networks. The gradient attenuates significantly during backpropagation, making training and optimization not easy.
When the input resolution changes, the amount of calculation explodes The growth of CNN's receptive field is roughly proportional to the number of layers multiplied by the convolution step size. When the image resolution is increased from 224×224 to 448×448, the calculation amount may directly increase by more than 4 times.

1.2 Why can Transformer cross borders?

Transformer proves in the field of NLP that "treating everything as a sequence and letting the model learn the relationship by itself" is a one-size-fits-all approach. Moving to CV brings three natural advantages:

Global receptive field, effective from the first layer Self-attention allows information at any location to be directly interacted, eliminating the need to pass it on layer by layer. The pixels in the upper left corner of the image can talk directly to the pixels in the lower right corner.
Flexible architecture, one-click scaling Want a larger model? Just adjust the number of layers, embedding dimensions, and number of attention heads. The same Transformer skeleton can cover different tasks such as classification, detection, and segmentation with slight modifications.
Parallel Computing Friendly Different from the sequential processing of RNN, the calculation of self-attention can be performed simultaneously, which greatly improves the training efficiency.

The idea of ViT is exactly this: **Cut the image into small patches (patches), treat them as words in NLP, and then throw them directly into the standard Transformer encoder to see what can be learned. **

2. ViT’s core architecture

2.1 Understand the workflow of ViT with one picture

The entire ViT process has only six steps and is very straightforward:

graph LR
A[输入图像<br/>224x224x3] --> B[Patch Embedding<br/>拆成16x16的patch<br/>每个投影到768维]
B --> C[拼接Class Token<br/>用于全局分类]
C --> D[添加可学习位置编码<br/>保留空间位置]
D --> E[标准Transformer编码器<br/>12层×12头]
E --> F[取Class Token输出<br/>过 MLP Head 分类]

Input image: A 224×224 color picture.
Patch Embedding: Cut the image into 16×16 grids like cutting tofu. Each grid (patch) is mapped into a 768-dimensional vector through a convolution layer with a convolution kernel step size equal to the patch size.
Class Token: Insert a learnable vector specifically for global classification at the front of the sequence, just like adding a "summary sentence" at the beginning of an article.
Position encoding: In order to let the model know the spatial relationship between patches, a learnable vector is added to each position.
Transformer Encoder: Standard multi-layer Transformer, each block contains multi-head self-attention and MLP.
Classification header: Take the output of Class Token, pass through a layer of normalization and a linear layer, and output the final classification probability.

2.2 Implementing ViT-B/16 from scratch using PyTorch

Below we write the code directly, targeting a ViT-B/16 with the same configuration as the original paper. Detailed comments are added to each part, and the code can be run directly.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# ----------------------------
# 1. Patch Embedding 模块
# ----------------------------
class PatchEmbedding(nn.Module):
    """
    将图像切成 patch 并线性投影到 embed_dim 维度。
    等价于一个 kernel_size=stride=patch_size 的二维卷积。
    """
    def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super().__init__()
        self.img_size = img_size
        self.patch_size = patch_size
        # 计算 patch 的数量，例如 224/16 = 14，14×14 = 196
        self.n_patches = (img_size // patch_size) ** 2
        
        # 用卷积层代替手动切片，既简洁又高效
        self.conv_proj = nn.Conv2d(
            in_channels, embed_dim, 
            kernel_size=patch_size, stride=patch_size
        )
        
    def forward(self, x):
        # x: (batch_size, 3, 224, 224)
        x = self.conv_proj(x)   # -> (batch_size, 768, 14, 14)
        x = x.flatten(2)        # -> (batch_size, 768, 196)
        x = x.transpose(1, 2)   # -> (batch_size, 196, 768)
        return x

# ----------------------------
# 2. 多头自注意力 (MHSA)
# ----------------------------
class MultiHeadSelfAttention(nn.Module):
    """标准的多头自注意力，一次生成 Q、K、V。"""
    def __init__(self, embed_dim=768, n_heads=12, dropout=0.1):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = embed_dim // n_heads
        assert self.head_dim * n_heads == embed_dim, "embed_dim 必须能被 n_heads 整除"
        
        # 三个线性变换合为一个，更高效
        self.qkv_proj = nn.Linear(embed_dim, embed_dim * 3)
        self.attn_dropout = nn.Dropout(dropout)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
        
    def forward(self, x):
        batch_size, seq_len, embed_dim = x.size()
        
        # 1. 生成 Q、K、V 并分成多头
        qkv = self.qkv_proj(x)  # (batch, seq_len, 3*embed_dim)
        qkv = qkv.reshape(batch_size, seq_len, 3, self.n_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, batch, n_heads, seq_len, head_dim)
        q, k, v = qkv[0], qkv[1], qkv[2]
        
        # 2. 缩放点积注意力
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        attn_probs = F.softmax(attn_scores, dim=-1)
        attn_probs = self.attn_dropout(attn_probs)
        
        # 3. 加权求和并拼回原始维度
        context = torch.matmul(attn_probs, v)  # (batch, n_heads, seq_len, head_dim)
        context = context.transpose(1, 2).reshape(batch_size, seq_len, embed_dim)
        return self.out_proj(context)

# ----------------------------
# 3. MLP 块
# ----------------------------
class MLPBlock(nn.Module):
    """ViT 中的 MLP：两层全连接，隐藏层宽 4 倍，GELU 激活。"""
    def __init__(self, embed_dim=768, mlp_dim=3072, dropout=0.1):
        super().__init__()
        self.fc1 = nn.Linear(embed_dim, mlp_dim)
        self.gelu = nn.GELU()
        self.fc2 = nn.Linear(mlp_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.gelu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        x = self.dropout(x)
        return x

# ----------------------------
# 4. Transformer 编码器层 (Pre-LN)
# ----------------------------
class TransformerEncoderLayer(nn.Module):
    """Pre-LN 架构：先做 LayerNorm，再过注意力和 MLP，最后加上输入形成残差。"""
    def __init__(self, embed_dim=768, n_heads=12, mlp_dim=3072, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(embed_dim)
        self.mhsa = MultiHeadSelfAttention(embed_dim, n_heads, dropout)
        self.ln2 = nn.LayerNorm(embed_dim)
        self.mlp = MLPBlock(embed_dim, mlp_dim, dropout)
        
    def forward(self, x):
        x = x + self.mhsa(self.ln1(x))   # 自注意力 + 残差
        x = x + self.mlp(self.ln2(x))    # 前馈网络 + 残差
        return x

# ----------------------------
# 5. 完整的 Vision Transformer
# ----------------------------
class VisionTransformer(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, num_classes=1000,
                 embed_dim=768, depth=12, n_heads=12, mlp_dim=3072, dropout=0.1):
        super().__init__()
        
        # Patch 嵌入
        self.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)
        n_patches = self.patch_embed.n_patches
        
        # 可学习的 Class Token（一句话的「CLS」标记）
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        
        # 可学习的位置编码（不做固定正弦编码，完全让模型自己学）
        self.pos_embed = nn.Parameter(torch.zeros(1, n_patches + 1, embed_dim))
        self.pos_dropout = nn.Dropout(dropout)
        
        # 堆叠 Transformer 编码器
        self.encoder = nn.Sequential(*[
            TransformerEncoderLayer(embed_dim, n_heads, mlp_dim, dropout)
            for _ in range(depth)
        ])
        
        # 分类头
        self.ln_head = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, num_classes)
        
        # 权重初始化
        self._init_weights()
        
    def _init_weights(self):
        nn.init.trunc_normal_(self.pos_embed, std=0.02)
        nn.init.trunc_normal_(self.cls_token, std=0.02)
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.trunc_normal_(m.weight, std=0.02)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.LayerNorm):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
        
    def forward(self, x):
        batch_size = x.size(0)
        
        # 步骤 1：Patch 嵌入 -> (batch, 196, 768)
        x = self.patch_embed(x)
        
        # 步骤 2：拼接 Class Token -> (batch, 197, 768)
        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)
        
        # 步骤 3：加上位置编码并 dropout
        x = x + self.pos_embed
        x = self.pos_dropout(x)
        
        # 步骤 4：通过编码器
        x = self.encoder(x)
        
        # 步骤 5：取出 Class Token 的输出做分类
        x = self.ln_head(x[:, 0])   # 只取第一个位置
        x = self.head(x)
        
        return x

# ----------------------------
# 6. 测试模型
# ----------------------------
if __name__ == "__main__":
    # 实例化 ViT-B/16
    vit_base = VisionTransformer(
        img_size=224, patch_size=16,
        embed_dim=768, depth=12, n_heads=12, mlp_dim=3072,
        num_classes=1000
    )
    
    # 查看参数量
    total_params = sum(p.numel() for p in vit_base.parameters())
    print(f"ViT-B/16 总参数量: {total_params / 1e6:.1f}M")  # 约 86M
    
    # 前向传播测试
    dummy_img = torch.randn(1, 3, 224, 224)
    output = vit_base(dummy_img)
    print(f"输入形状: {dummy_img.shape}")
    print(f"输出形状: {output.shape}")  # 应为 (1, 1000)

2.3 Code Tips

Patch Embedding is essentially convolution: we directly use akernel_size=stride=162D convolution implementation, eliminating the trouble of manual slicing and linear projection.
One-time generation of Q, K, and V for multi-head attention: Compared with writing three linear layers separately, merge them into onenn.Linear(embed_dim, embed_dim*3)Then split it, which is more efficient.
Positional encoding is fully learnable: The original ViT paper uses learnable positional encoding instead of the common sinusoidal positional encoding in NLP, allowing the model to fit the best spatial relationship by itself.
Pre-LN architecture: LayerNorm is placed before attention/MLP, making training more stable.

3. ViT vs CNN: intuitive comparison

Dimensions	ViT-B/16	ResNet-50
Total parameters	~86M	~25M
Global receptive field	Available on layer 1	Need to stack more than 30 layers
Inductive bias	Almost none	Strong (locality, translation invariance)
Performance on small data sets (such as CIFAR-10)	Easy to overfit, requiring enhancement or knowledge distillation	Stable
Large-scale data set pre-training (such as ImageNet-21K)	Surpassing ResNet-50 SOTA	Reaching the bottleneck
Interpretability	Regions of interest can be highlighted through attention weights	Feature maps, Grad‑CAM

One sentence summary: **ViT consumes more data, but has a higher upper limit; CNN is more friendly to less data, but has limited potential in large-scale scenarios. ** This is why almost all large models in the industry today are built on Transformer.

4. Tips on training and use

4.1 Core training skills

Pre-training data must be large enough: It is recommended to have at least 1 million images, and the best choice is ImageNet-21K (about 14 million images) or JFT-300M (used in the original paper).
Data augmentation takes the drastic step: Don’t just use simple random cropping and flipping. Powerful enhancement methods such as RandAugment, MixUp, and CutMix are crucial to the effectiveness of ViT.
**What to do with small and medium-sized data sets? ** Use knowledge distillation directly. DeiT uses RegNetY-16GF as the teacher model and can train powerful ViT on ImageNet-1K without additional data.
Learning rate setting: During pre-training, the learning rate of ViT-B can be set to 3e-3, with cosine annealing and warmup.

4.2 Directly use the pre-trained model

In daily use, we almost do not need to train a ViT from scratch.torchvisionandtimmBoth libraries provide a wealth of pre-trained weights.

import torchvision.models as models
import torch.nn as nn

# 加载在 ImageNet-1K 上预训练的 ViT-B/16
vit_b = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)
print(vit_b.eval())

# 如果要做自己的分类任务（比如 10 类），直接替换分类头即可
vit_b.heads = nn.Linear(768, 10)

If you want more flexible model selection and configuration, it is recommended to usetimmLibrary:

import timm

# timm 中的 ViT 大集合
model = timm.create_model('vit_base_patch16_224', pretrained=True)
print(model)

Summarize

ViT is not trying to "overthrow CNN", but to introduce a more general and scalable modeling paradigm for computer vision. Most of the top vision models we see today - DETR, Mask2Former, CLIP, SAM - directly use ViT or its variants (Swin, DeiT, etc.) as the backbone network.

If you want to master ViT in depth, it is recommended to follow the following path:

Run through the PyTorch code above and feel the flow of data for yourself.
usetimmLoad a pretrained model and fine-tune it on CIFAR-100 or your own dataset.
Read the original paper An Image is Worth 16x16 Words to learn more detailed experiments and analysis.
Then read the papers of DeiT and Swin Transformer to see how researchers solve ViT’s data hunger and hierarchical structure problems.

Try a "DeiT-style" fine-tuning with CIFAR-100: Loading`timm`in`deit_base_patch16_224`Pre-train weights, replace the classification head, and train with RandAugment and MixUp, you will see a stunning result - even on a small data set of 100 categories, you can achieve good accuracy.

🔗 Extended reading

#Vision Transformer (ViT) Explained: A complete guide from theory to practice

#Introduction

#1. Background and motivation of ViT

#1.1 Three core limitations of traditional CNN

#1.2 Why can Transformer cross borders?

#2. ViT’s core architecture

#2.1 Understand the workflow of ViT with one picture

#2.2 Implementing ViT-B/16 from scratch using PyTorch

#2.3 Code Tips

#3. ViT vs CNN: intuitive comparison

#4. Tips on training and use

#4.1 Core training skills

#4.2 Directly use the pre-trained model

#Summarize

#Related tutorials