title: Detailed explanation of the complete architecture of Transformer: from self-attention mechanism to encoder-decoder design and PyTorch implementation | Daoman PythonAI description: In-depth understanding of the core components of the Transformer architecture, including self-attention mechanism, multi-head attention, position encoding, residual connections, and layer normalization. Contains complete PyTorch implementation, mathematical principles and practical application scenarios. keywords: [Transformer, self-attention, multi-head attention, position encoding, residual connection, layer normalization, NLP, deep learning, PyTorch, machine learning]

Detailed explanation of the complete architecture of Transformer: from self-attention mechanism to encoder-decoder design and PyTorch implementation


Transformer Architecture Overview

The underlying core architecture of GPT-4o, Gemini, and Claude, which are currently attracting attention, are all the Transformer proposed in the 2017 Google paper "Attention is All You Need".

It directly abandons the RNN/LSTM/GRU loop structure that traditional NLP relies on, as well as the local receptive field of CNN, and relies entirely on the attention mechanism to handle everything. This is a disruptive paradigm shift in the history of deep learning. It not only unlocks the possibility of training very large models, but also allows the model to remember all the details of very long texts at a glance.

Core structure disassembly

Transformer consists of symmetric N encoder layers + N decoder layers (N=6 in the paper). The overall process is as follows:

flowchart LR
    subgraph Input [输入部分]
        I[输入序列] --> W[词嵌入]
        W --> P[位置编码]
        P --> IW[加权嵌入]
    end

    subgraph Encoder [编码器N层]
        IW --> MHA[多头自注意力]
        MHA --> ADD1[残差+层归一化]
        ADD1 --> FFN[前馈网络]
        FFN --> ADD2[残差+层归一化]
        ADD2 --> EO[编码器输出]
    end

    subgraph Decoder [解码器N层]
        TI[目标序列前缀] --> TW[词嵌入]
        TW --> TP[位置编码]
        TP --> TWI[加权嵌入]
        TWI --> MMHA[掩码多头自注意力]
        MMHA --> TADD1[残差+层归一化]
        TADD1 --> CA[交叉注意力]
        CA --> TADD2[残差+层归一化]
        TADD2 --> TFFN[前馈网络]
        TFFN --> TADD3[残差+层归一化]
        TADD3 --> DO[解码器输出]
    end

    EO -.->|K/V| CA
    DO --> F[最终线性层+Softmax]
    F --> O[输出概率分布]

Four major advantages compared to traditional models

  • ✅ Perfect parallelization: Unlike RNN, which has to wait for the previous word to be processed, Transformer can process all positions at the same time, increasing the training speed dozens of times.
  • ✅ Easily capture long dependencies: The attention mechanism directly calculates the association between any two words, without the need to "transmit messages layer by layer" like RNN.
  • ✅ Comes with interpretability: The output attention weight can be visualized, and you can clearly see which words the model is "looking at"
  • ✅ Super simple expansion framework: Just stack the number of layers, add dimensions, and expand parameters. It is applicable from BERT-base with 12 layers to GPT-4o with tens of thousands of layers.

Detailed explanation of self-attention mechanism

Self-attention is the heart of Transformer, which allows each position in the sequence to "distribute attention as needed" to all other positions.

Intuitive understanding: Use query-key-value to buy things

Imagine you are looking for snacks in the supermarket:

  • You (Query): The information you currently want to know - "Are there any salty potato chips?"
  • Shelf Label (Key): Characteristics of other snacks used to match your needs - "Sweet Biscuits", "Salty Potato Chips", "Sugar-Free Coke"
  • Snack itself (Value): The actual content of other snacks - "Lays Cucumber Flavor" "Oreo Original Flavor"...
  • Final selection (weighted sum): According to the label matching degree (attention weight), get the most matching snack

This mechanism allows the model to independently decide which words in the context to focus on when processing a certain word.

PyTorch minimalist implementation

import torch
import torch.nn as nn
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    缩放点积注意力的核心实现
    Q/K/V: (batch_size, seq_len, d_k/d_v)
    """
    d_k = Q.size(-1)
    # 1. 计算Q和K的相似度(注意力分数)
    # 除以sqrt(d_k)是为了防止分数太大导致softmax梯度消失
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    
    # 2. 应用掩码(比如解码器里不能看未来的词)
    if mask is not None:
        scores.masked_fill_(mask == 0, -1e9)
    
    # 3. 用softmax把分数转成0-1的权重,总和为1
    attention_weights = torch.softmax(scores, dim=-1)
    
    # 4. 加权求和V得到最终输出
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

Divide by the root sign in the coded_kThe operation, also called scaling operation, can avoid the vanishing gradient problem caused by excessive dot product results.


Multi-head attention mechanism

Single-head attention can only focus on one "semantic pattern", such as "finding the relationship between subject and predicate"; Multi-head attention means opening multiple "semantic radars" at the same time, one looking for subject and predicate, one looking for referent, and one looking for emotional words. Finally, by putting the results together, the model's ability is directly doubled.

Implementation ideas

Multi-head attention first splits the input features into several subspaces, each subspace calculates attention independently, and finally splices the results of each head and then performs a linear transformation.

PyTorch implementation

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0, "d_model必须能被num_heads整除"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # 四个线性变换:Q/K/V的生成,以及多头结果的合并
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(dropout)
        
    def split_heads(self, x, batch_size):
        """把大的向量拆成多个小的子向量,并行计算"""
        x = x.view(batch_size, -1, self.num_heads, self.d_k)
        return x.transpose(1, 2)  # 变成(batch_size, num_heads, seq_len, d_k)
    
    def combine_heads(self, x, batch_size):
        """把多个子向量拼回原来的形状"""
        x = x.transpose(1, 2).contiguous()
        return x.view(batch_size, -1, self.d_model)
    
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)
        
        # 1. 线性变换生成初始Q/K/V
        Q = self.W_q(Q)
        K = self.W_k(K)
        V = self.W_v(V)
        
        # 2. 拆成多头
        Q = self.split_heads(Q, batch_size)
        K = self.split_heads(K, batch_size)
        V = self.split_heads(V, batch_size)
        
        # 3. 计算缩放点积注意力
        scaled_attn, attn_weights = scaled_dot_product_attention(Q, K, V, mask)
        
        # 4. 合并多头
        concat_attn = self.combine_heads(scaled_attn, batch_size)
        
        # 5. 最终线性变换
        output = self.W_o(concat_attn)
        
        return output, attn_weights

##Positional Encoding

Transformer has no loop structure, and it "cannot see the order of words" - for example, "I hit you" and "You hit me" are the same in its eyes. So we need to explicitly add a "position tag" to each word, which is position encoding.

Fixed position encoding in the paper

The paper uses sine/cosine functions to generate fixed position codes. This solution has two major benefits:

  • No additional training parameters are required, it is completely generated by the function
  • Able to naturally generalize to longer sequence lengths than when trained
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # 创建一个足够大的位置编码矩阵
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        
        # 偶数维度用sin,奇数维度用cos
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # 注册为buffer,不会被优化器更新
        self.register_buffer('pe', pe.unsqueeze(0))
        
    def forward(self, x):
        """把位置编码加到词嵌入上"""
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

Encoder architecture

The encoder is responsible for understanding the contextual information of the input sequence and converting each word into a "vector containing the semantics of the entire sequence".

Single encoder layer

Each encoder layer consists of two core modules: Multi-head self-attention + Feedforward network (FFN), each module is followed by Residual connection + Layer normalization (these details will be discussed separately later).

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.mha = MultiHeadAttention(d_model, num_heads, dropout)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # 残差+层归一化的顺序是:LayerNorm(x + SubLayer(x))
        # 先过自注意力
        attn_out, _ = self.mha(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_out))
        
        # 再过前馈网络
        ffn_out = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_out))
        
        return x

Complete encoder

The full encoder first goes through word embedding + positional encoding, and then stacks N identical encoder layers.

class Encoder(nn.Module):
    def __init__(self, num_layers, d_model, num_heads, d_ff, vocab_size, max_len, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len, dropout)
        self.enc_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        
    def forward(self, x, mask=None):
        # 论文里提到要把词嵌入乘以sqrt(d_model),防止位置编码的影响被淹没
        x = self.embedding(x) * math.sqrt(self.d_model)
        x = self.pos_encoding(x)
        
        for layer in self.enc_layers:
            x = layer(x, mask)
            
        return x

##Decoder architecture

The decoder is responsible for generating the output sequence autoregressively - every time a word is generated, it is added back to the input to generate the next one.

Single decoder layer

The decoder layer has one more cross attention module than the encoder layer, and the whole is composed of three sub-layers:

  • Mask multi-head self-attention: with mask to prevent the decoder from seeing future words (guaranteed autoregressive generation)
  • Cross Attention: Q comes from the decoder, K and V come from the output of the encoder, allowing the decoder to "reference" the information of the input sequence
  • Feedforward Network: exactly the same as the encoder

Each sub-layer also uses residual connections and layer normalization. The code structure is similar to the encoder layer and will not be repeated here.

The core difference is that the mask of the self-attention part is an upper triangular matrix, ensuring that position i can only see position i and the words before it.


Residual connection and layer normalization

These two are the key to enabling deep Transformer to be trained! Without them, the gradient may disappear or explode if you stack 3 layers.

Residual connection

The idea is very simple: add the input of the sub-layer directly to the output of the sub-layer, that is, input + sub-layer output. In this way, even if the sub-layer learning effect is not good, the input information can be passed on directly to avoid information loss and greatly alleviate the vanishing gradient problem.

Layer normalization

Unlike batch normalization (BN), layer normalization (LN) normalizes the feature dimensions of each sample. Sequence lengths are often inconsistent in NLP tasks. BN is very unstable in this scenario, but LN is not affected at all, so it has become the standard configuration of Transformer.

The combination of the two allows Transformer to easily stack dozens or even hundreds of layers.


Complete Transformer model implementation

Assembling the encoder and decoder, plus the final linear layer and softmax, the complete Transformer model is obtained. Due to space limitations, only the core splicing idea is given here: the encoder processes the source sequence, the decoder takes the prefix of the target sequence and the encoder output as input, gradually predicts the next word, and finally outputs a probability distribution.

If you want to run the complete code directly, you can refer to the classicThe Annotated Transformerproject, or use PyTorch's built-innn.TransformerQuick module verification.


Practical applications and variations

The current mainstream large models are all variants of Transformer, which are mainly divided into three categories:

variant typeencoder onlydecoder onlyfull encoder-decoder
Representative modelsBERT, RoBERTaGPT series, LLaMAT5, BART, Gemini Pro
Applicable scenariosText classification, question and answer, named entity recognition (understanding category)Text generation, dialogue, code completion (generation category)Machine translation, summary, text rewriting (conversion category)

Transformer is the cornerstone of modern NLP, but beginners do not need to write the complete model by hand from the beginning - they can first use the Hugging Face Transformers library to adjust the pre-trained model, and then come back to the core code after running through a few small projects. The efficiency will be much higher!

Summarize

Transformer's success lies in its simplicity and versatility:

  • Use self-attention to solve long dependencies and parallelize
  • Use residual + layer normalization to solve the deep training problem
  • The architecture is modular and can be easily expanded to very large models

Understand Transformer, and you will open the door to modern large models!