Detailed explanation of Positional Encoding - the core technology of injecting positional information into the attention mechanism | Daoman PythonAI
#Positional Encoding: Inject a "sense of position" into an unordered matrix
📂 Phase: Phase 3 — Transformer Revolution (Core) 🔗 Related Chapters: 多头注意力 (Multi-Head Attention) · Transformer 完整架构
1. Why do we need to add "location signal"?
1.1 Self-Attention is born with "disorder"
The most amazing thing about Self-Attention in Transformer is that it can process the entire sequence in parallel in one go. No matter how long a sentence is, all words can calculate the attention weights between each other at the same time, completely getting rid of the "word-by-word" serial limitation of RNN.
But the cost of efficiency is: The attention matrix itself does not know the order of words at all. When doing calculations, it only cares about "whether this word and that word are semantically similar." As for who is the subject, who is the object, who is at the beginning of the sentence, who is at the end of the sentence - it does not ask at all.
To give the most intuitive example:
If these two sentences are input into a Self-Attention without position information, the calculated attention weights and semantic vectors are almost exactly the same. The model cannot tell whether it is "I hit you" or "You hit me." At this time, Transformer is essentially just an "advanced bag-of-words model" and cannot complete tasks such as translation, summarization, and dialogue that must consider word order.
1.2 The most direct solution: label the word with a "position tag"
Since Self-Attention inherently has no sense of order, we simply manually generate a unique "position vector" for each position, and then combine it with the semantic vector of the word itself. This way:
- Word vector is responsible for remembering "what does this word mean"
- Position vector is responsible for remembering "which position of the sentence is this word in?"
- After the two are added, a final representation with both semantic and positional information is formed.
In the original Transformer and most subsequent models, the direct addition method (rather than splicing) was chosen, because splicing will double the vector dimension and increase the amount of additional calculations.
2. The classic solution of the original Transformer: sine/cosine absolute position encoding
The idea of absolute position encoding is very straightforward: assign a fixed and unique vector to "Position 0", "Position 1" and all the way to "Position N". Rather than randomly generating these vectors, the designers of the original Transformer used a set of mathematically excellent combinations of sine and cosine functions.
2.1 Complete PyTorch implementation
We directly move out a version that is optimized in practice: position encoding existsbuffer, it does not participate in backpropagation and will be saved with the model; at the same time, Dropout is added to prevent overfitting.
2.2 Why choose sine and cosine?
You may ask, "Why not just use a set of randomly generated fixed vectors? Save the trouble." The answer is that the sine/cosine function has three key advantages that random vectors do not have:
-
The relative position relationship can be expressed linearly This design allows the model to easily learn relative information such as "position 3 is 3 steps further back than position 0." This feature is crucial for tasks such as translation and summarization that require attention to the distance between preceding and following words.
-
Can be generalized to longer sequences that have not been seen Even if it is set during training
max_lenIt is 5000. If a sentence of 6000 words suddenly comes up during reasoning, there is no need to panic at all - just use the same set of formulas to calculate the encoding of the next 1000 positions without retraining. -
Computational efficiency and zero extra parameters The position encoding is fixed during the entire training process, does not generate gradients, does not occupy the computing resources of backpropagation, and the inference speed is also very fast.
3. A solution more suitable for the task: Absolute position encoding can be learned
Although sine/cosine coding is good, it is a manually designed fixed value after all, and may not be perfectly suitable for all tasks. For this reason, a more flexible way has emerged: directly treating the position vector as a model parameter, and training it end-to-end together with the word vector.
3.1 Minimalist PyTorch implementation
usenn.Embedding, it can be done with just a few lines of code:
3.2 Applicable scenarios and limitations
On short text tasks (such as text classification, named entity recognition), learnable positional encoding often performs better than sine/cosine encoding because it can automatically adjust the positional representation according to data characteristics.
But it has a flaw: Defaultmax_lenIt's the ceiling. If a longer sequence appears during reasoning, it can only be truncated or filled with zeros, and it can no longer be extrapolated at will like sine/cosine.
4. Standard configuration of modern large language models: RoPE (rotated positional encoding)
The first two absolute position encodings, whether fixed or learnable, directly "add" the position vector to the word vector. When the sequence length is stretched to tens or even hundreds of thousands (such as 128K for LLaMA 3.1, 200K for Claude 3.5 Sonnet), this operation can easily lead to the attenuation of long-distance position information.
RoPE (Rotary Position Embedding, Rotation Position Encoding) proposed by Su Jianlin's team in 2021 provides an elegant solution: it no longer "adds" a position vector, but directly uses the rotation matrix to "rotate" Q (query) and K (key) in the attention mechanism, allowing relative position information to be naturally integrated into the dot product operation.
4.1 Core Logic (Vernacular Version)
There is no need to struggle with complicated mathematical derivation, just master two points:
- Group the vectors of Q and K into pairs (for example, if the dimension is 512, it is divided into 256 pairs)
- Each pair of vectors is rotated according to its own angle. The further back the position is, the greater the rotation angle
In this way, when we calculate the dot product of Q and K (which is the core step of attention weighting), the position difference information is automatically included, and the attenuation as the distance increases is extremely slow, which is especially suitable for ultra-long context scenarios.
4.2 Minimalist PyTorch implementation (rotating core)
💡 When actually used, we will use the returned
cos_embandsin_embRotate the Q and K vectors and then calculate the attention. Since this part belongs to the internal implementation details of the Attention layer, it will not be expanded here. Interested students can refer to the original paper at the end of the article.
5. Comparison of three mainstream position encodings
💡 One sentence summary: Transformer without positional encoding is equivalent to an advanced bag-of-words model, which cannot even distinguish between "you hit me" and "I hit you"; by 2026, if you want to make long texts and large models, just go to RoPE directly.
🔗 Extended reading

