Vision Transformer: Detailed explanation from image slicing to Patch Embedding

Introduction

Vision Transformer (ViT) is a blockbuster in the field of computer vision. It boldly applies the highly successful Transformer architecture in natural language processing directly to image classification, and achieves impressive results. The core idea of ViT is simple but ingenious: Cut the image into small patches (patch), "feed" these image patches to the Transformer like words in a sentence, and let the model learn on its own which parts are worthy of attention.

This tutorial will use easy-to-understand language to dismantle the core design of ViT step by step: how to cut images, how to do patch embedding, how to add positional encoding, how to calculate multi-head attention... In addition to theoretical explanations, it will also give implementation code based on PyTorch and demonstrate how to use pre-trained models. Whether you are new to Visual Transformer or want to check for gaps, I hope this article will help you get started easily.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 关键点检测 (Keypoints) · Swin Transformer

1. The core idea of Vision Transformer

1.1 Understand pictures as “language”

Traditional convolutional neural networks (CNN) slowly expand the receptive field by continuously stacking convolutional layers, and gradually aggregate global semantics from local textures. The Vision Transformer sees the entire image as soon as it comes up, and each image block can directly interact with all other blocks. This "big picture view" is its power.

The main innovations of ViT can be summarized in three sentences:

Image blocking: Cut the entire image into small squares (patch) of a fixed size. When expanded, each square is like a "token" (word) in NLP.
Sequence Modeling: Treat these patches as a sequence, send them to the Transformer encoder, and use the self-attention mechanism to find the relationship between them.
Global connection: Starting from the first layer, each patch can see all other patches, which is naturally suitable for capturing long-distance dependencies.

1.2 ViT development timeline

2017: The Transformer architecture was proposed in the paper "Attention Is All You Need".
2018: Models such as BERT make Transformer shine in the field of NLP.
2020: ViT is released, proving for the first time that pure Transformer can also surpass CNN in image classification.
2021 to present: Hierarchical ViTs such as Swin Transformer and PVT have emerged and are gradually becoming popular in detection and segmentation tasks.

2. Detailed explanation of ViT architecture

ViT's workflow can be summarized into four steps:

Cut the image into patches and map them into fixed-length vectors (Patch Embedding).
Add a class token specifically for classification at the beginning of the sequence.
Add position coding to let the model know the original position of each patch in the image.
Send it to the Transformer encoder, and finally use the output of class token for classification.

Below we dismantle each step in detail and combine it with the code to deepen our understanding.

2.1 Image Blocking and Patch Embedding

This is the first step in ViT and the key to converting an image from a "grid structure" to a "sequence".

Assuming the size of the input image is 224×224, we set the patch size to 16×16. In this way, the entire image is cut into 14 pieces in the horizontal and vertical directions, for a total of 14×14 = 196 patches. Each patch contains 16×16×3 = 768 pixel values, which are mapped to a new vector space through a linear projection layer (the dimensions can remain the same or change, the original ViT paper maintains 768 dimensions).

Patch Embedding process diagram: Input image: (B, 3, 224, 224) Split and expand: (B, 196, 768) Each patch vector length: 768

The following is the code to implement image tiles and projection using PyTorch and einops:

import torch
import torch.nn as nn
from einops import rearrange

class ImageToPatches(nn.Module):
    """
    图像到 patch 的转换模块
    """
    def __init__(self, image_size=224, patch_size=16, channels=3):
        super().__init__()
        self.image_size = image_size
        self.patch_size = patch_size
        self.num_patches = (image_size // patch_size) ** 2
        self.patch_dim = channels * patch_size ** 2

        # 线性投影层：将每个 patch 的原始像素映射到目标维度
        self.projection = nn.Linear(self.patch_dim, self.patch_dim)

    def forward(self, x):
        """
        x: (batch, channels, height, width)
        return: (batch, num_patches, patch_dim)
        """
        batch_size, channels, height, width = x.shape

        # 简单校验输入尺寸
        assert height == self.image_size and width == self.image_size, \
            f"输入图像尺寸应为 ({self.image_size}, {self.image_size})"

        # 使用 einops 的 rearrange 优雅切分
        x = rearrange(
            x,
            'b c (h p1) (w p2) -> b (h w) (p1 p2 c)',
            p1=self.patch_size,
            p2=self.patch_size
        )

        # 线性投影
        x = self.projection(x)

        return x

In the code,rearrangeReplace the original(B, C, H, W)The tensor of is rearranged into(B, num_patches, patch_dim), one line of code completes segmentation and flattening, clean and neat.

2.2 Complete implementation of ViT

After mastering Patch Embedding, we can build a complete Vision Transformer model. Here is a clean and readable PyTorch implementation:

import torch
import torch.nn as nn

class VisionTransformer(nn.Module):
    """
    Vision Transformer 完整实现
    """
    def __init__(
        self,
        image_size=224,
        patch_size=16,
        num_classes=1000,
        dim=768,             # 嵌入维度
        depth=12,            # Transformer 层数
        heads=12,            # 注意力头数
        mlp_dim=3072,        # 前馈网络扩展维度
        dropout=0.1,
        emb_dropout=0.1
    ):
        super(VisionTransformer, self).__init__()
        num_patches = (image_size // patch_size) ** 2
        patch_dim = 3 * patch_size ** 2

        # 用卷积一步完成切块+投影 (卷积核大小和步长都等于 patch_size)
        self.to_patch_embedding = nn.Sequential(
            nn.Conv2d(3, patch_dim, kernel_size=patch_size, stride=patch_size),
            nn.Flatten(start_dim=2),            # (B, patch_dim, H, W) -> (B, patch_dim, num_patches)
            nn.Linear(patch_dim, dim)
        )

        # 可学习的分类 token 和位置编码
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))

        self.dropout = nn.Dropout(emb_dropout)

        # Transformer 编码器（使用 PyTorch 官方的 TransformerEncoderLayer）
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=dim,
            nhead=heads,
            dim_feedforward=mlp_dim,
            dropout=dropout,
            activation='gelu',
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=depth)

        # 最后的分类头
        self.mlp_head = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, num_classes)
        )

    def forward(self, img):
        # 1. 获取 patch 嵌入
        x = self.to_patch_embedding(img)    # (B, num_patches, dim)
        b, n, _ = x.shape

        # 2. 在最前面拼接 class token
        cls_tokens = self.cls_token.repeat(b, 1, 1)   # (B, 1, dim)
        x = torch.cat([cls_tokens, x], dim=1)         # (B, num_patches+1, dim)

        # 3. 加上位置编码
        x = x + self.pos_embedding[:, :(n + 1)]
        x = self.dropout(x)

        # 4. 送入 Transformer 编码器
        x = self.transformer(x)

        # 5. 取出 class token 对应位置的输出进行分类
        cls_output = x[:, 0]    # (B, dim)
        output = self.mlp_head(cls_output)

        return output

This code basically replicates the structure of ViT. It is recommended that beginners read it line by line to understand the changes in the tensor shape at each step.

2.3 Detailed explanation of multi-head self-attention mechanism

Self-attention is the core of Transformer and the key to ViT’s ability to “see the whole picture”. Here I give an implementation that is closer to the original formula to help everyone understand the internal calculation process:

import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    """
    多头自注意力机制
    """
    def __init__(self, d_model=768, num_heads=12, dropout=0.1):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # 生成 Q, K, V 的线性变换层
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)
        # 缩放因子
        self.scale = torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32))

    def forward(self, x):
        batch_size, seq_len, _ = x.shape

        # 线性投影
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)

        # 拆分成多头: (B, seq_len, num_heads, d_k) -> (B, num_heads, seq_len, d_k)
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

        # 计算注意力分数
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        attention = F.softmax(scores, dim=-1)
        attention = self.dropout(attention)

        # 加权求和
        output = torch.matmul(attention, V)   # (B, num_heads, seq_len, d_k)

        # 合并多头
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)

        # 最终线性变换
        output = self.W_o(output)

        return output

You can intuitively see from this code: The attention mechanism essentially allows each position in the sequence to weight and aggregate the information of all positions based on its similarity to all positions.

3. Positional encoding and Class Token

3.1 Position encoding: Let the model know "where you are"

Transformer itself does not have the ability to sense the input order, so position information must be injected into each patch. ViT uses learnable position encoding, which directly initializes a set of parameters and allows the model to adjust itself during the training process.

Commonly used position encoding types are:

Learnable: ViT's default approach, simple and direct.
Sinusoidal: No additional parameters required, but less effective in ViT.
Two-dimensional position encoding (2D PE): retains the horizontal and vertical coordinate information of the patch, which is more suitable for images.
Rotary Position Encoding (Rotary PE): Works well with large models and long sequences.

In ViT, the position encoding vector is added directly to the patch embedding, with the shape(num_patches + 1, dim). The reason why it is needed+1, because there is still a place reserved for class token.

3.2 Class Token: a special role that “oversees the overall situation”

ViT places a learnableclass token. This token does not come from any image patch, but after multiple layers of Transformer, it will gradually aggregate the information of all patches and become the "spokesperson" of the entire image. Ultimately we just need to extractclass tokenThe output vector is sent to the classification head to complete the classification.

The advantages of this design are:

Centrally aggregate image-level semantic information.
Avoids the need to design additional global pooling layers.
with BERT in[CLS]Tokens come from the same origin and are easy to understand and migrate.

4. Variations and improvements of ViT

4.1 DeiT: Data-efficient version of ViT

DeiT mainly solves the problem of ViT relying on massive data pre-training. It introduces a teacher model (usually a convolutional network) to guide ViT learning through knowledge distillation, so that a good model can be trained even with only a million-level data set. In addition, it uses stronger data augmentation and regularization methods.

4.2 More efficient ViT variants

The computational effort of pure ViT scales with the square of the number of patches, making it more expensive on high-resolution images. Subsequent research proposed many lightweight or efficient versions, such as:

MobileViT: Incorporates the local advantages of convolution and is suitable for mobile devices.
PVT (Pyramid Vision Transformer): uses gradually shrinking feature maps to form a pyramid structure similar to CNN.
Swin Transformer: Calculate self-attention within a local window and achieve cross-window interaction by moving the window, significantly reducing the amount of calculation.
Twins: Combines spatial attention and sequential self-attention, taking into account both local and global aspects.

These variants allow ViT not only to perform well in classification, but also to be efficiently used for downstream tasks such as detection and segmentation.

5. Use pre-trained models

As developers, we don't need to train ViT from scratch in most cases. Both PyTorch and Hugging Face provide pre-trained models out of the box.

5.1 Using PyTorch official model

import torch
from torchvision import models, transforms

# 加载预训练 ViT-B/16
model = models.vit_b_16(weights='IMAGENET1K_V1')
model.eval()

# 图像预处理
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

# 假设已经得到 input_tensor，Shape = (1, 3, 224, 224)
with torch.no_grad():
    output = model(input_tensor)
    probabilities = torch.nn.functional.softmax(output[0], dim=0)

5.2 Using Hugging Face Transformers

from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import requests

# 加载处理器和模型
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

# 读取图片
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

# 推理
outputs = model(**inputs)
predicted_class_idx = outputs.logits.argmax(-1).item()
print("预测类别:", model.config.id2label[predicted_class_idx])

The interface of Hugging Face is very simple and very fast to get started, which is very suitable for quickly verifying ideas or building prototypes.

6. Comparison between ViT and CNN

ViT and CNN are not an either-or relationship. They each have their own strengths. Let’s compare them from several key dimensions:

Features	CNN	ViT
Receptive field	Gradually expand from a small range, from local to global	Global vision from the beginning
Parameter efficiency	Usually the number of parameters is small and the calculation efficiency is high	The number of parameters is large and more data is needed
Data requirements	It is easy to obtain better results on small data sets	Rely on large-scale pre-training or knowledge distillation
Interpretability	Feature maps are not easy to understand intuitively	Attention weights can be directly visualized
Computational overhead	Linear relationship with image size	Square level growth as the number of patches increases
Inductive bias	Strong (translation invariance, local correlation)	Weak (more general, closer to original data)

Selection suggestions:

Small data sets, real-time applications, and mobile deployment are still the areas of strength of CNN.
In scenarios where there is sufficient data, higher accuracy is pursued, and multi-modal fusion is required, ViT will be more eye-catching.

7. Practical skills and tuning

To train or fine-tune a ViT, you can usually use the following techniques:

Optimizer: Prioritize using AdamW, combined with weight decay.
Learning rate scheduling: warmup + cosine annealing to make training more stable.
Data enhancement: RandAugment, Mixup, CutMix, etc. can effectively improve generalization capabilities.
Regularization: Dropout, Stochastic Depth, label smoothing.
Knowledge Distillation: Use large models or CNN teacher models to guide small ViT, and the effect is significant.

If you want to deploy to production environment:

Model Quantization: INT8 quantization can significantly reduce the size and accelerate inference.
Mixed Precision Training: Save video memory and increase speed.
Sparse Attention: Reduce the amount of computation by limiting the attention span.
Distilled small model: You can also get performance close to that of large models on the mobile terminal.

ViT is an important milestone in computer vision. It is recommended to take the time to understand the basic principles of Transformer (especially self-attention), and then look back at the implementation of ViT, which will be much easier. At the same time, running the code and trying to load the pre-trained model for prediction are also good ways to quickly build intuition.

8. Summary

Vision Transformer proved to the world that image classification can be done well or even better without convolutions. Its core innovation points can be summarized into three links:

Image Patch: Turn images into sequences, breaking through visual and language modeling barriers.
Global self-attention: Each patch can directly model global dependencies and capture long-distance features.
Scalability: Models can be increased in depth and width like a stack of Lego, further improving performance with large amounts of data.

Whether you are engaged in computer vision research or want to implement cutting-edge technology into products, Vision Transformer is a technology worthy of careful understanding.

💡 Important reminder: The emergence of ViT has opened a new era of unified modeling of visual and language models, and has also given birth to a series of phenomenal multi-modal models such as CLIP and DALL·E.

🔗 Extended reading

#Vision Transformer: Detailed explanation from image slicing to Patch Embedding

#Introduction

#1. The core idea of ​​Vision Transformer

#1.1 Understand pictures as “language”

#1.2 ViT development timeline

#2. Detailed explanation of ViT architecture

#2.1 Image Blocking and Patch Embedding

#2.2 Complete implementation of ViT

#2.3 Detailed explanation of multi-head self-attention mechanism

#3. Positional encoding and Class Token

#3.1 Position encoding: Let the model know "where you are"

#3.2 Class Token: a special role that “oversees the overall situation”

#4. Variations and improvements of ViT

#4.1 DeiT: Data-efficient version of ViT

#4.2 More efficient ViT variants

#5. Use pre-trained models

#5.1 Using PyTorch official model

#5.2 Using Hugging Face Transformers

#6. Comparison between ViT and CNN

#7. Practical skills and tuning

#Related tutorials

#8. Summary