Sequence-to-sequence model (Seq2Seq): Encoder-Decoder architecture

📂 Stage: Stage 2 - Deep Learning and Sequence Model (Advanced) 🔗 Related chapters: 长短时记忆网络 LSTM/GRU · 注意力机制

If you have used translation software, chatted with AI, or seen pictures automatically converted into text descriptions, then you have most likely been exposed to the Sequence to Sequence (Seq2Seq) model. It is a classic architecture for processing variable-length input → variable-length output tasks in deep learning. It is also one of the ancestors of Transformer, which later became popular in the AI ​​​​circle. Chewing it through will help you understand why the Attention mechanism "must be born out of nowhere."


1. What is Seq2Seq?

1.1 One-sentence definition + mainstream scenarios

The core logic of Seq2Seq can be summarized in one sentence:

输入任意长度的序列 → 模型“消化”成统一载体 → 输出任意长度的目标序列

This framework covers almost all "sequence conversion" tasks. Here are some familiar examples:

├── 机器翻译:中文“你好,NLP爱好者!” → 英文“Hello, NLP lovers!”
├── 文本摘要:2000 字科技新闻 → 300 字核心要点
├── 对话系统:用户“今天天气怎么样?” → 机器人自动回复
├── 代码生成:自然语言“写一个 Python 列表去重的函数” → Python 代码块
├── 语音识别:一段 30 秒的中文音频 → 对应文字转录
└── 图像描述(变体):一张猫跳栏杆的图 → “一只橘猫正在跳过白色的小栏杆”

Simply put, Seq2Seq can come in handy as long as the input and output are sequences and the length is not fixed.

1.2 Disassembly of the classic Encoder-Decoder architecture

Although the application scenarios of Seq2Seq are diverse, the core structure is always a combination of two recurrent neural networks (RNN/LSTM/GRU):

Seq2Seq = 负责“读”的 Encoder + 负责“写”的 Decoder + 连接两者的 Context Vector

We use Chinese→English translation ("I love NLP"→"I love NLP") to give a specific example:

  1. Encoder Read the input Chinese word sequence ("I", "love" and "NLP") one by one. Bidirectional LSTM was commonly used in the early days, so that each word can see both the upper and lower context at the same time. After reading the entire sentence, the model's final hidden state (and cell state, if using LSTM) is packed into a fixed-length context vector.
  • You can understand this Context Vector as: the model "compresses" the semantics of the entire input sequence into a summary card.
  1. Decoder With this "summary card" as the initial state, start with a special<START>Starting with the tag, the English translation is generated word by word. Every time a new word is generated, it is used as input for the next step until it encounters<END>until marked.

This set of "compression-generation" process is the most classic Seq2Seq process.


2. PyTorch minimalist Seq2Seq implementation

All talk and no practice! We use PyTorch to write a small model based on Encoder based on bidirectional LSTM + Decoder based on simple unidirectional LSTM, and attach the two most commonly used decoding methods.

2.1 Complete basic model code

import torch
import torch.nn as nn
import random

# --------------------------
# 双向 LSTM Encoder
# --------------------------
class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2):
        super().__init__()
        # 词嵌入层:把 token id 转成向量
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        # 双向 LSTM:能同时看到输入词的“上文”和“下文”
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim, num_layers,
            batch_first=True, bidirectional=True,
            dropout=0.2 if num_layers > 1 else 0
        )
        # 把双向最后一层的隐藏/细胞状态拼接成单方向,适配单向 Decoder
        self.hidden_fc = nn.Linear(hidden_dim * 2, hidden_dim)
        self.cell_fc = nn.Linear(hidden_dim * 2, hidden_dim)

    def forward(self, input_ids):
        # input_ids shape: (batch_size, seq_len)
        embedded = self.embedding(input_ids)  # (batch_size, seq_len, embed_dim)

        # outputs: 所有时间步的输出;(hidden, cell): 最后时间步的状态
        outputs, (hidden, cell) = self.lstm(embedded)

        # 拼接双向的最后隐藏层(hidden[-2] 是正向,hidden[-1] 是反向)
        hidden = torch.cat([hidden[-2], hidden[-1]], dim=-1)
        cell = torch.cat([cell[-2], cell[-1]], dim=-1)
        # 压缩成单向 Decoder 需要的维度
        hidden = self.hidden_fc(hidden).unsqueeze(0)  # (1, batch_size, hidden_dim)
        cell = self.cell_fc(cell).unsqueeze(0)

        return outputs, (hidden, cell)


# --------------------------
# 带简单拼接 Context 的单向 LSTM Decoder
# --------------------------
class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        # 输入是当前词嵌入 + 固定 Context Vector
        self.lstm = nn.LSTM(
            embed_dim + hidden_dim, hidden_dim, num_layers,
            batch_first=True,
            dropout=0.2 if num_layers > 1 else 0
        )
        self.fc = nn.Linear(hidden_dim, vocab_size)  # 输出每个词的概率 logits

    def forward(self, input_t, hidden, cell, context):
        # input_t shape: (batch_size, 1) → 当前单个词
        embedded = self.embedding(input_t)  # (batch_size, 1, embed_dim)
        # 拼接上下文向量
        lstm_input = torch.cat([embedded, context.unsqueeze(1)], dim=-1)

        output, (hidden, cell) = self.lstm(lstm_input, (hidden, cell))
        logits = self.fc(output.squeeze(1))  # (batch_size, vocab_size)
        return logits, hidden, cell


# --------------------------
# 完整 Seq2Seq 模型
# --------------------------
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_ids, target_ids, teacher_forcing_ratio=0.5):
        batch_size = input_ids.size(0)
        target_len = target_ids.size(1)
        target_vocab_size = self.decoder.fc.out_features

        # 预分配输出的 logits 矩阵
        outputs = torch.zeros(batch_size, target_len, target_vocab_size).to(input_ids.device)

        # 先过 Encoder 拿到压缩状态和 Context
        _, (hidden, cell) = self.encoder(input_ids)
        context = hidden[-1]  # 取最后一层的隐藏状态作为固定 Context

        # 解码第一步:用 <START> token
        decoder_input = target_ids[:, 0:1]

        # 逐词生成
        for t in range(target_len):
            logits, hidden, cell = self.decoder(decoder_input, hidden, cell, context)
            outputs[:, t] = logits

            # Teacher Forcing 策略:随机选择用真实标签还是上一步预测结果
            teacher_force = random.random() < teacher_forcing_ratio
            top1_token = logits.argmax(1)
            decoder_input = target_ids[:, t:t+1] if teacher_force else top1_token.unsqueeze(1)

        return outputs
What is Teacher Forcing?

When training the Decoder, the above model will use the real target word as the input of the next step with a certain probability (for example, 50%), instead of the word predicted by the model itself. It's like a teacher leading students to write sentences, giving correct answers at every step to help the model converge faster. This method is called Teacher Forcing.

2.2 Two core decoding strategies

After the model is trained, how to generate a reasonable target sequence based on the output probability distribution? **The two most commonly used methods are described below.

① Greedy Decode

The most direct strategy: only select the word with the highest probability at each step until encountering<END>or reach maximum length.

def greedy_decode(model, input_ids, tokenizer, max_len=50,
                  start_token=2, end_token=3):
    model.eval()
    with torch.no_grad():
        # 先过 Encoder
        _, (hidden, cell) = model.encoder(input_ids)
        context = hidden[-1]

        # 初始化:只有 <START>
        decoder_input = torch.tensor([[start_token]]).to(input_ids.device)
        generated_tokens = []

        for _ in range(max_len):
            logits, hidden, cell = model.decoder(
                decoder_input, hidden, cell, context
            )
            # 取概率最高的词
            next_token = logits.argmax(1).item()

            if next_token == end_token:
                break
            generated_tokens.append(next_token)
            decoder_input = torch.tensor([[next_token]]).to(input_ids.device)

        # 用 tokenizer 把 id 转成文字
        return tokenizer.decode(generated_tokens)

Advantages: Fast and simple to implement. Disadvantages: Only looking at the immediate benefits, it is easy to miss the more reasonable overall combination (local optimal) later.

② Beam Search decoding

In order to alleviate the short-sighted problem of greedy decoding, Beam Search will maintain top-k "current optimal candidate sequences" at the same time, and k is called the "beam size".

def beam_search_decode(model, input_ids, tokenizer, max_len=50,
                        beam_size=5, start_token=2, end_token=3):
    model.eval()
    with torch.no_grad():
        # 先过 Encoder
        _, (hidden, cell) = model.encoder(input_ids)
        context = hidden[-1]

        # 初始化 beam 列表:每个 beam 记录 tokens、总得分、当前隐藏/细胞状态
        beams = [
            {"tokens": [start_token], "score": 0.0,
             "hidden": hidden, "cell": cell}
        ]

        for _ in range(max_len):
            all_candidates = []
            # 对每个当前 beam 展开
            for beam in beams:
                # 如果已经遇到 <END>,直接保留这个候选
                if beam["tokens"][-1] == end_token:
                    all_candidates.append(beam)
                    continue

                # 预测下一个词
                decoder_input = torch.tensor(
                    [[beam["tokens"][-1]]]
                ).to(input_ids.device)
                logits, new_hidden, new_cell = model.decoder(
                    decoder_input, beam["hidden"], beam["cell"], context
                )
                # 转成 log 概率(避免连乘下溢,用加法更稳定)
                log_probs = torch.log_softmax(logits, dim=-1)
                # 取 top-k 个候选词
                topk_log_probs, topk_tokens = log_probs.topk(beam_size, dim=-1)

                # 生成 k 个新候选
                for i in range(beam_size):
                    token = topk_tokens[0, i].item()
                    total_score = beam["score"] + topk_log_probs[0, i].item()
                    all_candidates.append({
                        "tokens": beam["tokens"] + [token],
                        "score": total_score,
                        "hidden": new_hidden,
                        "cell": new_cell,
                    })

            # 按总得分降序排序,取 top-k 保留为新的 beam
            all_candidates.sort(key=lambda x: x["score"], reverse=True)
            beams = all_candidates[:beam_size]

        # 最后选得分最高的,去掉开头的 <START> 再转文字
        return tokenizer.decode(beams[0]["tokens"][1:])

The essence of Beam Search: Leave several more paths at each step, and finally choose the one with the highest overall score. Although it is slower, it is usually smoother and more reasonable than greedy decoding.

:::tip How to choose beam size?

  • beam size = 1: In fact, it degenerates into greedy decoding.
  • beam size too large: The amount of calculation increases significantly and may generate repetitive or boring content.
  • The beam size commonly used in machine translation is generally between 4~8, which can be adjusted through experiments and verification sets. :::

3. The fatal problem of classic Seq2Seq → Leading to Attention

In our implementation above, a fixed-length Context Vector is used to compress the entire input sequence. Whether the input sentence is 3 words or 100 words, the Encoder must stuff all the information into a vector of the same dimension.

This is the biggest information bottleneck of classic Seq2Seq:

  • For long sentences, the following information is easily "swamped" by the previous ones, and the model has been "almost forgotten" by the time the model is generated in the second half.
  • When translating, the model can only rely on this compressed memory and cannot dynamically refer to different parts of the original text.

In order to solve this problem, the Attention mechanism was proposed. It allows the Decoder to "actively take a look" at the information at all positions in the input sequence when generating each word, and decide for itself which positions are most relevant to the current generation task. In this way, there is no capacity limit of a single Context Vector.

This is exactly what our next article (attention mechanism) will focus on.

💡 Small summary

  1. Classic Seq2Seq = Bidirectional Encoder compression + One-way Decoder word-by-word generation + Teacher Forcing training acceleration
  2. Greedy (fast but easy to local optimal) and Beam Search (slow but smoother) are commonly used for decoding.
  3. It is the only way to understand Transformer, but now pure Seq2Seq has been basically replaced by Transformer

🔗 Extended reading and papers