Self-Attention Self-attention calculation: the mathematical essence of Q, K, V matrices
📂 Stage: Stage 3 — Transformer Revolution (Core) 🔗 Pre-association: Seq2Seq 标准注意力入门 · 词嵌入与位置编码基础 🧱 Subsequent modules: 多头注意力 Multi-Head Attention · Transformer 编码器-解码器
1. Getting started: What is the difference between Self-Attention and standard Seq2Seq attention?
The core idea of the attention mechanism is just three words - "pick the key points". Whether it is reading a book, looking at pictures or processing sequences, we all hope that the model can focus its "eyes" on the most critical places. But the same is "picking the key points", Self-Attention and the attention in the classic Seq2Seq, the selection range is completely different.
1.1 One table understands two kinds of attention
One sentence summary: **Standard attention is "the decoder looks at the encoder", Self-Attention is "the sequence looks at itself". **
🧠 Small thoughts: It is precisely because Self-Attention works entirely on the same sequence that Transformer's encoder can calculate the updates of all words in parallel, while RNN must do it step by step, which is the key to the performance revolution.
2. Core: Physical meaning and complete calculation of Q, K, V matrices
The core tools of Self-Attention are three learnable projection matrices:W_q、W_k、W_v. They are like three different pairs of "glasses", allowing the same word vector to play three different roles.
2.1 Give Q/K/V a “human” version of the metaphor
Suppose you have a pile of sticky notes to be organized, each sticky note represents a word (vectorx), your task is to write a richer, more contextual version of each note.
- 🕵️ Query (Q): I wrote on the note: "What kind of information do I need to find now?"
- 🏷️ Key (K): I wrote on the note: "What core labels do I have in myself?"
- 📦 Value(V): I wrote on the note: "If someone chooses me, what specific content can I share with them?"
The entire Self-Attention operation process is like a "full-person matching conference":
- Write Q, K, V on all notes (via
W_q、W_k、W_vprojected). - For note A, take its Q and do "similarity matching" (dot product) with K of all notes to get the matching score.
- Smooth the scores and convert them into weights between 0 and 1 (adding up to 1).
- Use these weights to weight the V of all notes to get a new representation of note A.
Thus, each word was re-released with the "collective wisdom" of the entire sequence.
2.2 Pure PyTorch implements single-head Self-Attention
The following code completely implements single-head Self-Attention, and every key calculation is marked with changes in tensor dimensions - for understanding Transformer, dimension is the lifeline.
💡 **Can’t understand dimensions? It doesn’t matter, just remember one rule: **
QandKThe dot product produces a "relationship matrix" of (seq_len × seq_len), with each row representing the attention score of that word to all words.- After normalization, it becomes a weight and then multiplied by
V, obtain a new representation that incorporates global information.
3. Advanced: Multi-Head Attention
Single-head Self-Attention can already solve many problems, but it can only learn one "attention mode" at a time. Just like when you close one eye and look at the world, you can perceive distance, but you can't see the three-dimensional depth clearly.
3.1 Why multiple heads?
Multi-head attention is equivalent to wearing several different pairs of glasses at the same time, each pair of glasses focusing on different language features:
- 🧐 Header 1: Responsible for capturing grammatical relationships (subject-predicate collocation)
- 🧐 Header 2: Responsible for capturing semantic relationships (cat-meow)
- 🧐 Header 3: Responsible for capturing the referential relationship (it → animal)
- 🧐 Header 4: Responsible for capturing long-range dependencies (because...so...)
Each attention head has its own independent set of projection matrices (W_q、W_k、W_v), so completely different matching rules can be learned through training. Finally, the outputs of all heads are concatenated and linearly transformed to obtain a semantically richer word representation.
3.2 Pure PyTorch implements multi-head attention
Below is a multi-attention module that can be used directly. For generality, we support Q, K, V from different inputs (Encoder-Decoder Attention), or they all come from the same input (Self-Attention).
✅ Usage suggestions:
- When building the Transformer layer,
MultiHeadAttentionIt’s the core building block. - PyTorch is also officially available
torch.nn.MultiheadAttention, but its input dimension order is (seq_len, batch, embed_dim), which is different from our habit (batch, seq_len, embed_dim). Remember to add when usingbatch_first=True。
4. Summary: One picture flow + two major advantages
4.1 Single-head Self-Attention minimalist process
Each time this process is calculated, each word in the sequence "communicates" with all words.
4.2 Why is Self-Attention so strong?
-
Long-distance dependence in one step For any two words in the sequence, no matter how far apart they are, their interaction path length is 1. In comparison, RNN requires O(n) steps and CNN requires O(log n) steps. Self-Attention directly captures the dependency of "beginning subject and ending predicate".
-
Fully parallel, GPU friendly The Q/K/V projection, similarity calculation, and weight normalization of all words can be directly thrown into the GPU and calculated in parallel from a large tensor. This is the key to how Transformer trains quickly and scales easily to large models.
🔗High quality extended reading
- The Illustrated Transformer(图解 Transformer,新手必看)
- PyTorch 官方 MultiheadAttention 文档(注意默认的维度顺序)
- Attention Is All You Need(Transformer 原始论文)
📘 After mastering Q/K/V and multi-head attention, the next step is to build a complete Transformer encoder, so stay tuned!

