Long short-term memory network (LSTM/GRU): solve the vanishing gradient and capture long-distance dependencies
📂 Stage: Stage 2 - Deep Learning and Sequence Model (Advanced) 🎯 Prerequisite knowledge: Recurrent Neural Network (RNN) basics 🔗 Related chapters: 循环神经网络 (RNN) · 序列到序列模型 (Seq2Seq)
1. The core idea of LSTM: install an “information safe” for RNN
1.1 Pain point: "short-sightedness" of ordinary RNN
When you ask an ordinary RNN to read a long article, it will often "forget" the beginning of the article very quickly. For example, if you analyze a movie review - "The first 20 minutes of "The Beginning" is a bit slow, but the whole process is very high-energy, and the last 10 episodes can't be stopped at all." Ordinary RNN is likely to only remember the "can't stop" at the end. The negative signal of "the first 20 minutes is a bit slow" at the beginning almost disappears during backpropagation. This is a typical problem caused by vanishing gradient.
The design goal of LSTM is to solve this "amnesia". It introduces a Cell State that runs throughout the entire sequence, you can think of it as an information conveyor belt. The conveyor belt can stably carry long-term memory, and with three learnable "doors", you can decide:
- What old information should be forgotten from the conveyor belt?
- What new information should be written to the conveyor belt?
- In the final output, what content on the conveyor belt should be taken out and used?
These three gates are like traffic lights for data flow, allowing the model to control the flow of information in extremely fine detail.
1.2 Disassemble the calculation process of LSTM
To facilitate understanding, we take the sentiment analysis task as an example and gradually track the processing of a comment: "The beginning of this movie is a bit boring, but the ending is so healing and tearful!"
Step 1: Forgetting Gate—Cleaning Historical Memory
The forgetting gate determines "which old information in the cell state should be discarded." For example, when reading the word "but the ending", the model needs to realize that the weight of the previous "a bit boring" should be reduced or even deleted.
Specific method: change the hidden state of the previous momenth_prev(temporary memory of the previous step) and current inputx_now(word vectors) are put together, processed by a set of learnable weights, and then given a value ranging from[0,1]"switch function" between.
- Output close to
1means "completely reserved"; - Output close to
0means "can forget"; - Intermediate values indicate "partial retention".
Step 2: Input gate + candidate state - prepare new information
This stage determines "what new knowledge to add to the cell state" and is operated by two accessories:
- Input selection gate: Still splicing
h_prevandx_now, using switch functions to select which new information is worth remembering. - Candidate status: Same splicing
h_prevandx_now, but instead use a numerical range within[-1,1]The activation function generates a "new content draft". - Multiply the two - only the content "lit" by the input gate is actually written to the conveyor belt.
Step 3: Update cell status - refresh conveyor belt
This is the core calculation of LSTM:
- Multiply the output of the forgetting gate by the old cell state (the content on the conveyor belt at the last moment);
- Add the writing result of the input gate;
- Get the updated cell status.
In this way, unimportant old information is forgotten, fresh and important information is written, and the conveyor belt always carries the most critical global memory at the moment.
Step 4: Output gate + hidden state - decide what to output to the next layer
Finally, the model decides which information to pick from the conveyor belt to generate the current output (hidden state):
- Output selection gate: Same splicing
h_prevandx_now, select the exposed part of the conveyor belt through the switch function. - Normalized conveyor belt content: pass the cell state through
[-1,1]Compress the activation function to avoid excessively large values. - Multiply the two to get the hidden state at the current moment - it contains both the most important information at this moment and content with long-distance dependencies, and will be passed to the next moment or subsequent fully connected classification layer.
1.3 PyTorch LSTM Practical Combat: Bidirectional Sentiment Classifier
The following implements a complete bidirectional LSTM text classification model. Bidirectional LSTM can scan sequences from left to right and right to left at the same time, which is better for tasks such as sentiment analysis that require global understanding.
2. GRU: "Lightweight Lite Version" of LSTM
2.1 Improvement ideas of GRU
In 2014, Cho et al. proposed the Gated Recurrent Unit (GRU), which achieved similar effects to LSTM with a more concise structure. GRU merges the three gates of LSTM into two and removes the independent cell state - it uses a clever way to integrate "long-term memory" and "short-term temporary memory" into a unified hidden state.
2.2 PyTorch GRU actual combat: same task, lighter choice
Replacing the above LSTM model with GRU is very simple, just changenn.LSTMReplace withnn.GRU, also note that GRU does not return cell statusc_nThat’s it.
3. Practical clip: quickly run through emotion classification training
3.1 Training and verification of single-round functions
In order to make model training more stable, we usually use gradient clipping to prevent gradient explosion and calculate accuracy simultaneously.
3.2 Minimalist inference function
During inference, you only need to load the trained model, turn off gradient calculation and callsoftmaxYou can get the probability of each category.
4. Selection suggestions for 2026
4.1 LSTM vs GRU simple comparison
4.2 Practical application in 2026
I must be honest: Although LSTM/GRU is a "required course" for every deep learning practitioner to get started with sequence modeling, in the current NLP and speech fields, pre-trained models (BERT, GPT, Whisper, etc.) based on the Transformer architecture have basically occupied the mainstream. These models obtain powerful universal representations through massive unsupervised pre-training, and with fine downstream fine-tuning, their performance far exceeds that of LSTM/GRU trained from scratch. Moreover, with the in-depth optimization of the self-attention mechanism in hardware such as A100 and H100, the training efficiency of Transformer can even be as good as that of multi-layer LSTM.
4.3 When will LSTM/GRU be used again?
Nonetheless, LSTM and GRU are still active in the following scenarios:
- Edge devices and low-latency scenarios: small parameter scale, fast inference speed, and low hardware requirements.
- Small tasks with strong sequential nature: such as detection of timing anomalies in certain sensors, low-resource part-of-speech tagging of niche languages, etc.
- Academic Baseline Comparison: When doing research, LSTM/GRU is one of the most classic comparison models and an important reference for measuring the effectiveness of new methods.
🔗 Extended reading

