Recurrent Neural Network (RNN): logic for processing sequence data

📂 Stage: Stage 2 - Deep Learning and Sequence Model (Advanced) 🔗 Related chapters: PyTorch 基础 · 长短时记忆网络 LSTM/GRU


1. Why do we need RNN?

1.1 Limitations of traditional neural networks

Traditional fully connected networks (Dense) and convolutional networks (CNN) essentially process each input sample independently. No matter how we arrange the words in the text, as long as the word frequencies are similar, the feature vectors output by the model may be very similar.

Give a counter example of sentiment analysis

Text sequence: "This movie is ugly" vs "This movie is good". Traditional networks only count the words "movie", "this" and "part", and may even make the feature vectors of two sentences almost the same, completely unable to distinguish the semantic difference between "good-looking" and "ugly"!

This method of "only looking at word frequency, not order" seems to be inadequate when faced with sequence data with obvious order such as natural language, speech signals, and stock prices.

1.2 The core breakthrough of RNN: adding "memory"

The secret of Recurrent Neural Network (RNN for short) is hidden in its name - cyclically reusing the same neuron unit and introducing something called Hidden State to save "previously seen information". Simply put, when RNN reads a sequence, each step will fuse the current input with the "memory" left by the previous step to generate new memory and output.

For a more intuitive understanding, the RNN can be expanded along the time steps (each element in the sequence):

graph LR
    subgraph RNN 展开图
        h0((h₀))
        x0[x₀]
        r0[RNN 单元]
        h1((h₁))
        y0[y₀]
        x1[x₁]
        r1[RNN 单元]
        h2((h₂))
        y1[y₁]
        x2[x₂]
        r2[RNN 单元]
        y2[y₂]
    end

    h0 --> r0
    x0 --> r0
    r0 --> h1
    r0 --> y0
    h1 --> r1
    x1 --> r1
    r1 --> h2
    r1 --> y1
    h2 --> r2
    x2 --> r2
    r2 --> y2

The processing logic of each time step is the same:

  1. Receive current inputx_tand the hidden state of the previous steph_{t-1}
  2. Generate a new hidden state through the same RNN unith_tand current outputy_t

The internal work of the RNN unit is very simple: first multiply the current input and the previous hidden state by their respective weight matrices, add a bias, and finally compress the value to between (-1, 1) through the hyperbolic tangent (tanh) activation function to obtain the new hidden state. The whole process is exactly the same, no matter how long the sequence is, the same set of parameters is reused, so RNN can naturally handle sequences of any length.


2. The fatal problem of RNN

Although RNN solves the problem of "independent processing", it has two natural flaws, which leads to the fact that native RNN is basically no longer used in long sequence tasks**.

2.1 Gradient vanishing and exploding

The core of training a neural network is backpropagation: stepwise back from the output layer to calculate the impact of each parameter on the final loss, and then update the parameters. The special thing about RNN is that the parameters are reused at time steps, so during backpropagation, the gradient will be multiplied multiple times along the time step (the number of multiplications is equal to the length of the sequence).

  • Vanishing gradient: If the gradient value of continuous multiplication is generally less than 1, then after many multiplications, the gradient will become smaller and smaller, approaching 0. This means that the model has little ability to learn the current impact of information from long ago.
  • Gradient explosion: If the gradient value is greater than 1, after multiple multiplications, the gradient will increase exponentially and approach infinity, causing the loss value (Loss) during training to directly becomeNaN, training crashes.
Intuitive example of long-term dependency failure

“I was born in China…(1000 completely unrelated words mixed in)…I can say ___” The native RNN will most likely forget "China" in the first clause, making it difficult to fill in "Chinese"!

Gradient explosion can be alleviated by Gradient Clipping: setting a threshold and forcibly scaling down the gradient when it exceeds the threshold. However, the vanishing gradient problem is almost impossible to fix for native RNN, which directly gave rise to improved models such as LSTM and GRU.

2.2 Other shortcomings

In addition to the gradient problem, native RNN also has some minor flaws:

  • Serial calculation, limited efficiency: The calculation of each time step must wait for the completion of the previous step. It cannot be parallelized on a large scale like CNN, and the training speed is slow.
  • Sensitive to initial states: The choice of the initial hidden state (usually an all-zero vector) affects the learning effect at early time steps.

3. PyTorch RNN rapid implementation

Although native RNN is not commonly used, it is the basis for understanding LSTM and GRU. Next, we use PyTorch to implement a simple text classifier and experience how to use RNN.

3.1 One-way RNN text classifier

The code below builds a unidirectional RNN for sentiment binary classification. The input is the token id sequence of the sentence, and the output is the logits of the positive/negative class.

单向
import torch
import torch.nn as nn

class RNNTextClassifier(nn.Module):
    """简单的情感二分类:输入是token id序列,输出是正/负的logits"""
    def __init__(
        self, 
        vocab_size: int,        # 词表大小
        embed_dim: int = 128,   # 词嵌入维度
        hidden_dim: int = 128,  # RNN隐藏层维度
        num_layers: int = 2,     # RNN堆叠层数
        num_classes: int = 2,    # 分类数
        dropout: float = 0.3     # 防止过拟合的dropout率
    ):
        super().__init__()
        # 1. 词嵌入层:把token id转换成稠密向量
        self.embedding = nn.Embedding(
            vocab_size, embed_dim, padding_idx=0  # padding_idx=0表示填充的向量不参与更新
        )
        # 2. 单向 RNN 层
        self.rnn = nn.RNN(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,  # 关键!输入输出的第一维是batch_size,方便处理
            dropout=dropout if num_layers > 1 else 0.0,  # 只有多层RNN时才加dropout
            bidirectional=False  # 单向,只看前文
        )
        # 3. 全连接分类层
        self.classifier = nn.Linear(hidden_dim, num_classes)

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        """
        input_ids: (batch_size, seq_len) → batch_size个句子,每个句子seq_len个token id
        return: (batch_size, num_classes) → 每个句子的正/负logits
        """
        # 步骤1:词嵌入
        embedded = self.embedding(input_ids)  # (batch_size, seq_len, embed_dim)
        
        # 步骤2:过RNN
        # output: (batch_size, seq_len, hidden_dim) → 每个时间步的隐藏状态
        # hidden: (num_layers, batch_size, hidden_dim) → 每个堆叠层最后时刻的隐藏状态
        output, hidden = self.rnn(embedded)
        
        # 步骤3:取RNN的最后一个时间步的隐藏状态做分类
        # batch_first=True时,-1就是序列的最后一个时刻
        last_hidden = output[:, -1, :]
        logits = self.classifier(last_hidden)
        return logits

# 测试一下
if __name__ == "__main__":
    vocab_size = 10000  # 假设词表有10000个词
    model = RNNTextClassifier(vocab_size=vocab_size)
    # 生成假数据:32个batch,每个句子50个token(id从1到9999,0是填充)
    dummy_input = torch.randint(1, vocab_size, (32, 50))
    dummy_logits = model(dummy_input)
    print(f"输出形状:{dummy_logits.shape}")  # 应该是 torch.Size([32, 2])

3.2 Bidirectional RNN text classifier

Sometimes understanding a word requires not only looking at what was said before, but also the context behind it. For example, "I feel very __ today because I didn't eat hot pot." The emotion at the horizontal line is obviously negatively related to the following "I didn't eat hot pot."

Bidirectional RNN (Bidirectional RNN) trains structures in two directions at the same time:

  • Forward RNN: Process the sequence from left to right;
  • Backward RNN: Process sequences from right to left.

Finally, the hidden states at the last moments of the two directions are spliced ​​together as a representation of the entire sequence.

双向
import torch
import torch.nn as nn

class BiRNNTextClassifier(nn.Module):
    """双向情感二分类:同时看前后文的语义"""
    def __init__(
        self, 
        vocab_size: int,
        embed_dim: int = 128,
        hidden_dim: int = 128,
        num_layers: int = 2,
        num_classes: int = 2,
        dropout: float = 0.3
    ):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        # 关键!把bidirectional设为True
        self.rnn = nn.RNN(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0.0,
            bidirectional=True
        )
        # 因为是双向,隐藏维度要翻倍
        self.classifier = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        embedded = self.embedding(input_ids)
        # RNN返回的hidden是(forward_hidden, backward_hidden),每个的形状为(num_layers, batch_size, hidden_dim)
        output, (forward_hidden, backward_hidden) = self.rnn(embedded)
        # 取最后层的前向和后向隐藏状态,拼接
        combined_hidden = torch.cat([forward_hidden[-1], backward_hidden[-1]], dim=-1)
        logits = self.classifier(combined_hidden)
        return logits

# 同样测试一下
if __name__ == "__main__":
    vocab_size = 10000
    model = BiRNNTextClassifier(vocab_size=vocab_size)
    dummy_input = torch.randint(1, vocab_size, (32, 50))
    dummy_logits = model(dummy_input)
    print(f"双向RNN输出形状:{dummy_logits.shape}")  # 仍然是 torch.Size([32, 2])

4. Summary and quick review

4.1 Review of core knowledge points

  1. The role of RNN: Introducing "memory" into sequence data to solve the problem of traditional networks that "only look at word frequency and ignore position".
  2. Expand graph understanding: After expanding according to time steps, the same unit is reused at each time step, and the input is the current word and the previous step memory.
  3. Fatal problem:
  • The gradient disappears, resulting in long-distance dependencies not being captured;
  • Gradient explosion, leading to unstable training.
  1. Improvement direction: LSTM/GRU specifically solves gradient disappearance; gradient clipping specifically solves gradient explosion.

4.2 PyTorch RNN quick check

Parameters/OutputDescription
input_sizeFeature dimension of RNN input (such as word embedding dimension)
hidden_sizeHidden layer dimensions
num_layersNumber of stacked RNN layers
batch_first=TrueThe first dimension of input and output isbatch_size(recommended)
bidirectional=TrueTurn on bidirectional RNN
dropoutDropout rate between stacked layers (only ifnum_layers > 1effective when)
outputHidden state for all time steps:(batch_size, seq_len, hidden_size * dirs)
hiddenThe hidden state of all layers at the last moment:(num_layers * dirs, batch_size, hidden_size)
Practical Advice

Native RNN performs very poorly in long sequence tasks (long text classification, machine translation, etc.). Please use LSTM or GRU directly in actual projects. In the next article, we will analyze LSTM in depth.


🔗 Extended reading