Sequence-to-sequence model (Seq2Seq): Encoder-Decoder architecture
📂 Stage: Stage 2 - Deep Learning and Sequence Model (Advanced) 🔗 Related chapters: 长短时记忆网络 LSTM/GRU · 注意力机制
If you have used translation software, chatted with AI, or seen pictures automatically converted into text descriptions, then you have most likely been exposed to the Sequence to Sequence (Seq2Seq) model. It is a classic architecture for processing variable-length input → variable-length output tasks in deep learning. It is also one of the ancestors of Transformer, which later became popular in the AI circle. Chewing it through will help you understand why the Attention mechanism "must be born out of nowhere."
1. What is Seq2Seq?
1.1 One-sentence definition + mainstream scenarios
The core logic of Seq2Seq can be summarized in one sentence:
This framework covers almost all "sequence conversion" tasks. Here are some familiar examples:
Simply put, Seq2Seq can come in handy as long as the input and output are sequences and the length is not fixed.
1.2 Disassembly of the classic Encoder-Decoder architecture
Although the application scenarios of Seq2Seq are diverse, the core structure is always a combination of two recurrent neural networks (RNN/LSTM/GRU):
We use Chinese→English translation ("I love NLP"→"I love NLP") to give a specific example:
- Encoder Read the input Chinese word sequence ("I", "love" and "NLP") one by one. Bidirectional LSTM was commonly used in the early days, so that each word can see both the upper and lower context at the same time. After reading the entire sentence, the model's final hidden state (and cell state, if using LSTM) is packed into a fixed-length context vector.
- You can understand this Context Vector as: the model "compresses" the semantics of the entire input sequence into a summary card.
- Decoder
With this "summary card" as the initial state, start with a special
<START>Starting with the tag, the English translation is generated word by word. Every time a new word is generated, it is used as input for the next step until it encounters<END>until marked.
This set of "compression-generation" process is the most classic Seq2Seq process.
2. PyTorch minimalist Seq2Seq implementation
All talk and no practice! We use PyTorch to write a small model based on Encoder based on bidirectional LSTM + Decoder based on simple unidirectional LSTM, and attach the two most commonly used decoding methods.
2.1 Complete basic model code
When training the Decoder, the above model will use the real target word as the input of the next step with a certain probability (for example, 50%), instead of the word predicted by the model itself. It's like a teacher leading students to write sentences, giving correct answers at every step to help the model converge faster. This method is called Teacher Forcing.
2.2 Two core decoding strategies
After the model is trained, how to generate a reasonable target sequence based on the output probability distribution? **The two most commonly used methods are described below.
① Greedy Decode
The most direct strategy: only select the word with the highest probability at each step until encountering<END>or reach maximum length.
Advantages: Fast and simple to implement. Disadvantages: Only looking at the immediate benefits, it is easy to miss the more reasonable overall combination (local optimal) later.
② Beam Search decoding
In order to alleviate the short-sighted problem of greedy decoding, Beam Search will maintain top-k "current optimal candidate sequences" at the same time, and k is called the "beam size".
The essence of Beam Search: Leave several more paths at each step, and finally choose the one with the highest overall score. Although it is slower, it is usually smoother and more reasonable than greedy decoding.
:::tip How to choose beam size?
- beam size = 1: In fact, it degenerates into greedy decoding.
- beam size too large: The amount of calculation increases significantly and may generate repetitive or boring content.
- The beam size commonly used in machine translation is generally between 4~8, which can be adjusted through experiments and verification sets. :::
3. The fatal problem of classic Seq2Seq → Leading to Attention
In our implementation above, a fixed-length Context Vector is used to compress the entire input sequence. Whether the input sentence is 3 words or 100 words, the Encoder must stuff all the information into a vector of the same dimension.
This is the biggest information bottleneck of classic Seq2Seq:
- For long sentences, the following information is easily "swamped" by the previous ones, and the model has been "almost forgotten" by the time the model is generated in the second half.
- When translating, the model can only rely on this compressed memory and cannot dynamically refer to different parts of the original text.
In order to solve this problem, the Attention mechanism was proposed. It allows the Decoder to "actively take a look" at the information at all positions in the input sequence when generating each word, and decide for itself which positions are most relevant to the current generation task. In this way, there is no capacity limit of a single Context Vector.
This is exactly what our next article (attention mechanism) will focus on.
💡 Small summary
- Classic Seq2Seq = Bidirectional Encoder compression + One-way Decoder word-by-word generation + Teacher Forcing training acceleration
- Greedy (fast but easy to local optimal) and Beam Search (slow but smoother) are commonly used for decoding.
- It is the only way to understand Transformer, but now pure Seq2Seq has been basically replaced by Transformer
🔗 Extended reading and papers

