PyTorch basics and NLP adaptation: building the first text classifier

📂 Stage: Stage 2 - Deep Learning and Sequence Model (Advanced) 🔗 Related chapters: 词向量空间 · 循环神经网络 (RNN)


Hello, I am Daoman! 🤖 In the last article, we talked about the concept of word vectors, but theory alone is not enough - today we will directly take advantage of PyTorch, the "Swiss Army Knife" in the deep learning world, from Tensor basics, automatic derivation, step by step to a usable Chinese/English text classifier!

(Quietly said: The production environment will indeed directly go to Hugging Face in 2026, but learning the basics is to "have the confidence to make wheels", and this must be made up for!)


1. PyTorch core: Upgrade NumPy into a "deep learning dedicated library"

If you know NumPy, then getting started with PyTorch is almost light speed - Tensor is the PyTorch versionndarray, but it has two more superpowers:

  1. It can be run on GPU, and its speed is several times faster than CPU;
  2. Comes with automatic derivation to help you save the effort of manually formulating.

1.1 Learn to create Tensor in one minute

Let’s first look at the most commonly used creation methods, all of which are code practices👇

import torch
import numpy as np

# 1️⃣ 从 Python 原生列表转(最常用)
# 深度学习里多用 float32/float64,分类标签用 long 类型
text_emb_sample = torch.tensor([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]], dtype=torch.float32)
print(f"列表转Tensor:\n{text_emb_sample}\n形状:{text_emb_sample.shape}\n")

# 2️⃣ 从 NumPy 数组无缝转换
# 小心:它们默认共享内存,改了 NumPy 会连 Tensor 一起变
# 想要独立出来,后面加个 .clone() 就行
np_emb = np.array([[1, 2], [3, 4]])
torch_emb = torch.from_numpy(np_emb).float()  # 转成 float32
print(f"NumPy转Tensor:\n{torch_emb}\n")

# 3️⃣ 预定义占位 Tensor(初始化模型权重常用)
zero_emb = torch.zeros(2, 3)          # 全是 0
randn_emb = torch.randn(2, 3)         # 标准正态分布(均值0方差1)
range_ids = torch.arange(0, 10, 2)    # 像 range(0,10,2) → [0,2,4,6,8],做分词 ID 经常用

1.2 Tensor operation: 90% consistent with NumPy, with NLP-friendly features added

The most common operations in NLP are flattening, transposing, and average pooling - because they are used all the time: turning multiple word vectors of a whole batch of sentences into the representation of a sentence.

# 假设我们有一个 NLP 里常见的张量:(batch_size=32, seq_len=10, embed_dim=768)
x = torch.randn(32, 10, 768)

# 1️⃣ 形状变换(view 相当于 reshape,但要求内存连续)
flattened = x.view(32, -1)  # -1 自动计算剩下的维度 → (32, 7680)
print(f"展平后形状:{flattened.shape}\n")

# 2️⃣ 转置 / 维度交换(transpose 交换两个维度,permute 可以完全重排)
batch_seq_swapped = x.transpose(0, 1)  # 交换0和1维 → (10, 32, 768)
permuted = x.permute(2, 0, 1)          # 完全重排 → (768, 32, 10)
print(f"交换维度后形状:{batch_seq_swapped.shape}\n")

# 3️⃣ NLP 友好的聚合操作(把每个句子里的多个词向量合并成一个)
sentence_emb = x.mean(dim=1)          # 沿着 seq_len 维度取平均 → (32, 768)
print(f"句子聚合后形状:{sentence_emb.shape}")

💡 Tips:transposeOnly swap two dimensions,permuteYou can arrange any number of dimensions at once, often seen in Transformerpermute


2. PyTorch’s trump card: automatic derivation mechanism (autograd)

In the past, when writing neural networks, you had to manually push the chain rule and calculate the gradient of each weight - writing a few layers of CNN can make you bald. Now for PyTorchautogradWe do it all for you, you just build the model and it calculates the gradient by itself.

2.1 10 lines of code to understand autograd

Let's take a simple functiony = x² + 2x + 1, asking it to be inx=[2,3]Derivative of time:

# 1️⃣ 创建需要跟踪梯度的张量(requires_grad=True)
x = torch.tensor([2.0, 3.0], requires_grad=True)

# 2️⃣ 前向传播:PyTorch 会悄悄记录所有运算,构建计算图
y = x ** 2 + 2 * x + 1  # 对每个元素做运算
loss = y.sum()          # 最终必须得到一个标量,才能反向传播

# 3️⃣ 反向传播:梯度自动算出来
loss.backward()

# 4️⃣ 查看 x 的梯度(dy/dx = 2x + 2,x=2→6,x=3→8)
print(f"x 的梯度:{x.grad}")  # tensor([6., 8.])

Three lines of core code: forward calculationloss, callbackward(), read.grad. It's that simple!

2.2 Use nn.Module to encapsulate the simplest classifier

After knowing the principle of automatic derivation, we can use PyTorch'storch.nnThe module is connected to the network, so there is no need to write a row of matrix multiplication and activation functions yourself:

import torch.nn as nn

class SimpleMLP(nn.Module):
    def __init__(self, input_size=768, hidden_size=128, num_classes=2):
        super().__init__()  # 必须调用父类的 __init__
        # 像搭积木一样堆叠层
        self.layers = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, num_classes)
        )

    def forward(self, x):
        # 定义前向传播,autograd 会自动生成反向传播
        return self.layers(x)

# 测试一下模型
model = SimpleMLP()
test_input = torch.randn(32, 768)  # 32 个样本,每个 768 维
test_output = model(test_input)
print(f"模型输出形状:{test_output.shape}")  # (32, 2) → 每个样本两个分类的 logits

3. The actual battle begins! Build a text classifier from scratch

The foundation is laid, now let’s make a two-category text classifier (such as judging whether a comment is good or bad). The entire process is divided into three steps: data preparation → model construction → training loop.

3.1 Data preparation: Use Dataset + DataLoader to process text

PyTorch's standard process for processing text is:

  1. CustomizeDataset: Responsible for reading single data, word segmentation, ID conversion, and length completion;
  2. useDataLoaderPack into batches to automatically scramble and speed up reading.
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

# --------------------------
# 模拟一批中文评论数据
# --------------------------
texts = [
    "这部电影太好看了!演员演技在线", "剧情太拖沓,完全看不下去",
    "推荐给所有喜欢科幻片的朋友", "导演拍的什么东西,浪费时间",
] * 25  # 凑到 100 条
labels = [1, 0, 1, 0] * 25  # 1 好评,0 差评

# --------------------------
# 1️⃣ 简单的中文分词器 + 词表
# --------------------------
class SimpleTokenizer:
    def __init__(self, vocab_size=10000):
        # 预留 0 是 PAD(补全),1 是 UNK(未知词)
        self.vocab = {"<PAD>": 0, "<UNK>": 1}
        self.vocab_size = vocab_size

    def fit(self, texts):
        # 统计词频(这里简单按单字分)
        word_counts = {}
        for text in texts:
            for char in text:
                word_counts[char] = word_counts.get(char, 0) + 1
        # 只保留频率最高的前 vocab_size-2 个字
        sorted_words = sorted(word_counts.items(), key=lambda x: -x[1])[:self.vocab_size-2]
        for word, _ in sorted_words:
            self.vocab[word] = len(self.vocab)

    def tokenize(self, text):
        # 切成单字列表
        return [char for char in text]

    def convert_tokens_to_ids(self, tokens):
        # 遇到没见过的字就返回 UNK 的 ID
        return [self.vocab.get(token, 1) for token in tokens]

# --------------------------
# 2️⃣ 自定义 Dataset
# --------------------------
class ReviewDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=32):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        # 分词 → 转 ID → 截断 → 补全
        tokens = self.tokenizer.tokenize(text)[:self.max_len]
        token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
        padding_len = self.max_len - len(token_ids)
        token_ids += [0] * padding_len
        return {
            "input_ids": torch.tensor(token_ids, dtype=torch.long),
            "label": torch.tensor(label, dtype=torch.long)
        }

# --------------------------
# 3️⃣ 划分数据集并创建 DataLoader
# --------------------------
X_train, X_val, y_train, y_val = train_test_split(texts, labels,
                                                  test_size=0.2, random_state=42)
tokenizer = SimpleTokenizer(vocab_size=1000)
tokenizer.fit(X_train)

train_dataset = ReviewDataset(X_train, y_train, tokenizer, max_len=32)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_dataset = ReviewDataset(X_val, y_val, tokenizer, max_len=32)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

3.2 Model construction: Text classifier with Embedding layer

Just nowSimpleMLPThe step of "word → vector" is missing and must be added in NLPnn.Embeddinglayer:

class ReviewClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim=64, hidden_dim=32, num_classes=2, padding_idx=0):
        super().__init__()
        # 词嵌入层:把每个词 ID 变成一个稠密向量
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=padding_idx)
        # Dropout:随机丢弃一部分神经元,用来防止过拟合
        self.dropout = nn.Dropout(0.5)
        # 全连接分类层
        self.fc = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            self.dropout,
            nn.Linear(hidden_dim, num_classes)
        )

    def forward(self, input_ids):
        # input_ids: (batch_size, max_len)
        embedded = self.embedding(input_ids)          # → (batch_size, max_len, embed_dim)
        # 最简单的句子向量:对所有词向量求平均
        pooled = embedded.mean(dim=1)                 # → (batch_size, embed_dim)
        logits = self.fc(pooled)                      # → (batch_size, num_classes)
        return logits

📌 Here we use average pooling to aggregate a sentence. Although it is simple, it is enough for many short text tasks.

3.3 Training cycle: core “five steps”

No matter how complex the model is, PyTorch’s training loop structure will never change. Just remember these five steps:

import torch.optim as optim

# --------------------------
# 初始化模型、损失函数、优化器
# --------------------------
model = ReviewClassifier(vocab_size=len(tokenizer.vocab), embed_dim=64, hidden_dim=32)
criterion = nn.CrossEntropyLoss()   # 分类任务的标准选择
optimizer = optim.Adam(model.parameters(), lr=1e-3)   # 现在最常用的优化器

# --------------------------
# 训练循环
# --------------------------
num_epochs = 5
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)   # 把模型搬到显卡上(如果有的话)

print(f"开始训练,使用设备:{device}")

for epoch in range(num_epochs):
    # --------------------------
    # 训练阶段
    # --------------------------
    model.train()   # 切换成训练模式(启用Dropout、BatchNorm等)
    train_loss = 0.0
    for batch in train_loader:
        # ① 把数据按要求移到同一设备
        input_ids = batch["input_ids"].to(device)
        labels = batch["label"].to(device)
        # ② 梯度清零(不清的话会累加)
        optimizer.zero_grad()
        # ③ 前向传播
        logits = model(input_ids)
        # ④ 计算损失
        loss = criterion(logits, labels)
        # ⑤ 反向传播 + 更新参数
        loss.backward()
        optimizer.step()
        # 累加损失(后面算平均值)
        train_loss += loss.item() * input_ids.size(0)

    # --------------------------
    # 验证阶段
    # --------------------------
    model.eval()   # 切换成评估模式(关闭Dropout等)
    val_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():   # 验证时不用算梯度,省内存
        for batch in val_loader:
            input_ids = batch["input_ids"].to(device)
            labels = batch["label"].to(device)
            logits = model(input_ids)
            loss = criterion(logits, labels)
            val_loss += loss.item() * input_ids.size(0)
            # 计算准确率
            _, predicted = torch.max(logits.data, 1)   # 取概率最大的类别
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    # --------------------------
    # 打印本轮结果
    # --------------------------
    avg_train_loss = train_loss / len(train_dataset)
    avg_val_loss = val_loss / len(val_dataset)
    val_acc = correct / total
    print(f"Epoch {epoch+1}/{num_epochs} | 训练损失:{avg_train_loss:.4f} | "
          f"验证损失:{avg_val_loss:.4f} | 验证准确率:{val_acc:.4f}")

During training, you see the loss slowly decrease and the verification accuracy slowly increase. The feeling is really addictive 😄.


4. Practical Summary for 2026

Today we have walked the entire path from "Tensor basics" to "automatic derivation" to "complete text classifier". However, I still have to add a realistic statement here:

💡 Best Practices for Production Environments in 2026: Except for some low-level research or very small private data sets, most scenarios can directly use the pre-trained model of Hugging Face Transformers. With just a few lines of fine-tuning code, the effect can be much better than our own small model trained from scratch.

Of course, by learning these basics today, you will have the confidence to build your own wheels and understand the underlying principles, and you will be able to better control those "big models".

Finally, I have prepared a PyTorch NLP Cheat Sheet for you, save it for emergency use in the future👇

# PyTorch NLP 速查表(核心部分)
import torch
import torch.nn as nn
import torch.optim as optim

# 1️⃣ Tensor 创建(NLP 常用)
token_ids = torch.tensor([[1,2,3],[4,5,6]], dtype=torch.long)
zero_pad  = torch.zeros(2, 3, dtype=torch.long)
randn_emb = torch.randn(2, 3, 64)

# 2️⃣ 模型定义(标准模板)
class MyNLPModel(nn.Module):
    def __init__(self):
        super().__init__()
        # 放各种层
        self.embedding = nn.Embedding(10000, 64, padding_idx=0)
    def forward(self, x):
        # 定义数据流向
        return self.embedding(x).mean(dim=1)

# 3️⃣ 训练循环(五步走,永远不变)
optimizer.zero_grad()      # 梯度清零
output = model(input)      # 前向传播
loss = criterion(output, target)  # 计算损失
loss.backward()            # 反向传播
optimizer.step()           # 更新权重

🔗 Extended reading