Detailed explanation of Positional Encoding - the core technology of injecting positional information into the attention mechanism | Daoman PythonAI

#Positional Encoding: Inject a "sense of position" into an unordered matrix

📂 Phase: Phase 3 — Transformer Revolution (Core) 🔗 Related Chapters: 多头注意力 (Multi-Head Attention) · Transformer 完整架构


1. Why do we need to add "location signal"?

1.1 Self-Attention is born with "disorder"

The most amazing thing about Self-Attention in Transformer is that it can process the entire sequence in parallel in one go. No matter how long a sentence is, all words can calculate the attention weights between each other at the same time, completely getting rid of the "word-by-word" serial limitation of RNN.

But the cost of efficiency is: The attention matrix itself does not know the order of words at all. When doing calculations, it only cares about "whether this word and that word are semantically similar." As for who is the subject, who is the object, who is at the beginning of the sentence, who is at the end of the sentence - it does not ask at all.

To give the most intuitive example:

输入1:我 打 你   → 我是施害方,你是受害方
输入2:你 打 我   → 意思完全颠倒!

If these two sentences are input into a Self-Attention without position information, the calculated attention weights and semantic vectors are almost exactly the same. The model cannot tell whether it is "I hit you" or "You hit me." At this time, Transformer is essentially just an "advanced bag-of-words model" and cannot complete tasks such as translation, summarization, and dialogue that must consider word order.

1.2 The most direct solution: label the word with a "position tag"

Since Self-Attention inherently has no sense of order, we simply manually generate a unique "position vector" for each position, and then combine it with the semantic vector of the word itself. This way:

  • Word vector is responsible for remembering "what does this word mean"
  • Position vector is responsible for remembering "which position of the sentence is this word in?"
  • After the two are added, a final representation with both semantic and positional information is formed.

In the original Transformer and most subsequent models, the direct addition method (rather than splicing) was chosen, because splicing will double the vector dimension and increase the amount of additional calculations.


2. The classic solution of the original Transformer: sine/cosine absolute position encoding

The idea of ​​absolute position encoding is very straightforward: assign a fixed and unique vector to "Position 0", "Position 1" and all the way to "Position N". Rather than randomly generating these vectors, the designers of the original Transformer used a set of mathematically excellent combinations of sine and cosine functions.

2.1 Complete PyTorch implementation

We directly move out a version that is optimized in practice: position encoding existsbuffer, it does not participate in backpropagation and will be saved with the model; at the same time, Dropout is added to prevent overfitting.

import torch
import torch.nn as nn
import math

class SinCosPositionalEncoding(nn.Module):
    """原始 Transformer 论文中的正弦/余弦绝对位置编码"""
    def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
        """
        Args:
            d_model: 每个词的向量维度(论文中为 512)
            max_len: 预设的最大序列长度
            dropout: 防止过拟合
        """
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        # 初始化一个全 0 的位置编码矩阵,形状 (max_len, d_model)
        position_enc = torch.zeros(max_len, d_model)

        # 位置索引序列 [0, 1, 2, ..., max_len-1] → 形状 (max_len, 1)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)

        # 频率衰减因子:不同维度使用不同的变化频率
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )

        # 偶数维用 sin,奇数维用 cos,实现不同位置向量的唯一性
        position_enc[:, 0::2] = torch.sin(position * div_term)
        position_enc[:, 1::2] = torch.cos(position * div_term)

        # 扩展为 (1, max_len, d_model),方便与 batch 输入相加
        position_enc = position_enc.unsqueeze(0)

        # 注册为 buffer:不参与训练,但保存到模型文件中
        self.register_buffer("position_enc", position_enc)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: 输入词向量,形状 (batch_size, seq_len, d_model)
        Returns:
            加上位置编码后的向量
        """
        # 只取当前序列长度对应的位置编码
        x = x + self.position_enc[:, :x.size(1), :]
        return self.dropout(x)

2.2 Why choose sine and cosine?

You may ask, "Why not just use a set of randomly generated fixed vectors? Save the trouble." The answer is that the sine/cosine function has three key advantages that random vectors do not have:

  1. The relative position relationship can be expressed linearly This design allows the model to easily learn relative information such as "position 3 is 3 steps further back than position 0." This feature is crucial for tasks such as translation and summarization that require attention to the distance between preceding and following words.

  2. Can be generalized to longer sequences that have not been seen Even if it is set during trainingmax_lenIt is 5000. If a sentence of 6000 words suddenly comes up during reasoning, there is no need to panic at all - just use the same set of formulas to calculate the encoding of the next 1000 positions without retraining.

  3. Computational efficiency and zero extra parameters The position encoding is fixed during the entire training process, does not generate gradients, does not occupy the computing resources of backpropagation, and the inference speed is also very fast.


3. A solution more suitable for the task: Absolute position encoding can be learned

Although sine/cosine coding is good, it is a manually designed fixed value after all, and may not be perfectly suitable for all tasks. For this reason, a more flexible way has emerged: directly treating the position vector as a model parameter, and training it end-to-end together with the word vector.

3.1 Minimalist PyTorch implementation

usenn.Embedding, it can be done with just a few lines of code:

import torch
import torch.nn as nn

class LearnedPositionalEncoding(nn.Module):
    """端到端可学习的绝对位置编码"""
    def __init__(self, d_model: int, max_len: int = 5000):
        """
        Args:
            d_model: 向量维度
            max_len: 最大序列长度(推理时不能超过此值!)
        """
        super().__init__()
        # 使用 Embedding 存储可学习的位置向量
        self.position_emb = nn.Embedding(max_len, d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: 输入词向量,形状 (batch_size, seq_len, d_model)
        Returns:
            加上位置编码后的向量
        """
        batch_size, seq_len, _ = x.shape

        # 生成位置索引,并在 batch 维度上复制
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0).repeat(batch_size, 1)

        # 取出对应的位置向量并相加
        position_vectors = self.position_emb(positions)
        return x + position_vectors

3.2 Applicable scenarios and limitations

On short text tasks (such as text classification, named entity recognition), learnable positional encoding often performs better than sine/cosine encoding because it can automatically adjust the positional representation according to data characteristics.

But it has a flaw: Defaultmax_lenIt's the ceiling. If a longer sequence appears during reasoning, it can only be truncated or filled with zeros, and it can no longer be extrapolated at will like sine/cosine.


4. Standard configuration of modern large language models: RoPE (rotated positional encoding)

The first two absolute position encodings, whether fixed or learnable, directly "add" the position vector to the word vector. When the sequence length is stretched to tens or even hundreds of thousands (such as 128K for LLaMA 3.1, 200K for Claude 3.5 Sonnet), this operation can easily lead to the attenuation of long-distance position information.

RoPE (Rotary Position Embedding, Rotation Position Encoding) proposed by Su Jianlin's team in 2021 provides an elegant solution: it no longer "adds" a position vector, but directly uses the rotation matrix to "rotate" Q (query) and K (key) in the attention mechanism, allowing relative position information to be naturally integrated into the dot product operation.

4.1 Core Logic (Vernacular Version)

There is no need to struggle with complicated mathematical derivation, just master two points:

  • Group the vectors of Q and K into pairs (for example, if the dimension is 512, it is divided into 256 pairs)
  • Each pair of vectors is rotated according to its own angle. The further back the position is, the greater the rotation angle

In this way, when we calculate the dot product of Q and K (which is the core step of attention weighting), the position difference information is automatically included, and the attenuation as the distance increases is extremely slow, which is especially suitable for ultra-long context scenarios.

4.2 Minimalist PyTorch implementation (rotating core)

import torch
import torch.nn as nn

class RotaryPositionalEmbedding(nn.Module):
    """现代 LLM 标配的 RoPE 旋转位置编码(核心简化版)"""
    def __init__(self, head_dim: int, base: int = 10000):
        """
        Args:
            head_dim: 每个注意力头的向量维度
            base: 频率衰减基数(沿用原始 Transformer 的 10000)
        """
        super().__init__()
        # 预计算频率衰减因子
        self.inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))

    def forward(self, x: torch.Tensor, seq_len: int = None) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            x: Q/K 向量,形状 (batch_size, num_heads, seq_len, head_dim)
            seq_len: 当前序列长度
        Returns:
            cos_emb, sin_emb: 用于旋转的余弦和正弦向量
        """
        if seq_len is None:
            seq_len = x.size(2)

        # 生成位置序列
        t = torch.arange(seq_len, device=x.device).type_as(self.inv_freq)

        # 计算每个位置、每个维度对的旋转角度
        freqs = torch.einsum("i,j->ij", t, self.inv_freq)

        # 扩展成与 Q/K 一样的形状
        cos_emb = torch.cat([freqs.cos(), freqs.cos()], dim=-1)
        sin_emb = torch.cat([freqs.sin(), freqs.sin()], dim=-1)

        return cos_emb, sin_emb

💡 When actually used, we will use the returnedcos_embandsin_embRotate the Q and K vectors and then calculate the attention. Since this part belongs to the internal implementation details of the Attention layer, it will not be expanded here. Interested students can refer to the original paper at the end of the article.


5. Comparison of three mainstream position encodings

FeaturesSine/cosine absolute position encodingLearnable absolute position encodingRoPE (rotary position encoding)
Long context supportLimited but generalizable to longer sequencesCannot exceed predefined maximum length at allExcellent, easily handles 200K+ ultra-long contexts
Relative position learning abilityStrong (can be learned through linear relationships)Weak (highly dependent on the coverage of training data)Super strong (position information is naturally integrated into dot product calculations)
Computational efficiencyHigh (no additional parameters, no gradient required)Medium (introducing a small number of learnable parameters)High (no additional parameters, only rotation operations)
Applicable scenariosGeneral pre-training, tasks that require generalization to long textsShort text-specific tasks (classification, NER, etc.)Almost all modern open source/commercial large language models
Representative modelOriginal Transformer, BERTBERT-base early version, GPT-1LLaMA series, ChatGLM series, Qwen series

💡 One sentence summary: Transformer without positional encoding is equivalent to an advanced bag-of-words model, which cannot even distinguish between "you hit me" and "I hit you"; by 2026, if you want to make long texts and large models, just go to RoPE directly.


🔗 Extended reading