title: Practical text classification: A complete development guide for enterprise-level sentiment analysis engine based on BERT | Daoman PythonAI description: Build an enterprise-level Chinese sentiment analysis system from scratch, covering the complete development process of data preprocessing, model fine-tuning, evaluation indicators, model deployment and performance optimization. Contains detailed code implementation and best practices. keywords: [Sentiment analysis, text classification, BERT, deep learning, natural language processing, model fine-tuning, enterprise applications, NLP, machine learning]

Text Classification in Action: Complete Development Guide for Enterprise-Level Sentiment Analysis Engine Based on BERT

E-commerce customer service receives thousands of after-sales reviews every day. How to quickly screen out complaints that need urgent handling? After a new product is released, what is the reputation on social media like? These types of problems are essentially text classification tasks, and the lightweight sentiment analysis engine based on BERT is an efficient solution to solve such needs.

This tutorial will take you to build an enterprise-level Chinese sentiment analysis system from scratch. Not only can you learn the entire process of data cleaning, model fine-tuning, and evaluation and deployment, but you can also directly obtain a set of reusable engineering templates.

  • Two-class classification accuracy > 95%
  • Single response < 100ms
  • supports three types of text: e-commerce/social/customer service
  • Scalable fine-grained sentiment
  • model: bert-base-chinese
  • Training: Transformers + PyTorch
  • Deployment: FastAPI + Docker
  • Monitoring: Access Grafana on demand

Table of contents


Data pipeline construction

Data cleaning and exploration

High-quality data determines how high a model can fly. Let’s start with business-driven text preprocessing to remove noise and unify the format.

import pandas as pd
import re
from sklearn.model_selection import train_test_split
from typing import List, Tuple

class DataPreprocessor:
    def __init__(self):
        # 实际项目建议接入哈工大/百度等专业停用词库
        self.stop_words = {'的', '了', '在', '是', '很', '也', '这', '那', '都'}
    
    def clean_text(self, text: str) -> str:
        # 1. 保留中文、英文、数字,其余替换为空格
        text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9]', ' ', text)
        # 2. 压缩多余空格
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    
    def split_data(self, df: pd.DataFrame, text_col: str='text', label_col: str='label') -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
        # 分层采样,确保训练/验证/测试集中正负样本比例一致
        train, temp = train_test_split(df, test_size=0.3, random_state=42, stratify=df[label_col])
        val, test = train_test_split(temp, test_size=0.5, random_state=42, stratify=temp[label_col])
        return train, val, test

# 示例数据集(实际使用时改为从CSV/数据库/API加载)
def load_demo_data():
    data = {
        'text': [
            '这个产品质量很好,值得推荐!',
            '服务态度很差,完全不推荐。',
            '物流很快,包装也不错',
            '质量一般,性价比不高',
            '非常满意,下次还会购买',
            '商品与描述不符,很失望'
        ],
        'label': [1, 0, 1, 0, 1, 0]  # 1=正面,0=负面
    }
    return pd.DataFrame(data)

# 流程演示
df = load_demo_data()
preprocessor = DataPreprocessor()
df['cleaned_text'] = df['text'].apply(preprocessor.clean_text)

Layered division and lightweight enhancement

Small data sets are particularly susceptible to overfitting. On the basis of hierarchical partitioning, we introduce lightweight synonym replacement to expand training samples, which is low-cost but has obvious effects.

import random

# 简易同义词表(实际可安装pip install synonyms获取更丰富的词典)
def light_augment(text: str, n_aug: int=2) -> List[str]:
    synonyms_dict = {
        "好": ["优秀", "棒", "不错", "良好"],
        "坏": ["差", "糟糕", "不好", "恶劣"],
        "喜欢": ["喜爱", "欣赏", "钟爱", "青睐"],
        "讨厌": ["厌恶", "反感", "嫌弃", "不满"]
    }
    augmented = []
    words = list(text)
    for _ in range(n_aug):
        new_words = words.copy()
        for i, w in enumerate(new_words):
            if w in synonyms_dict and random.random() < 0.2:  # 20%概率替换
                new_words[i] = random.choice(synonyms_dict[w])
        augmented.append(''.join(new_words))
    return augmented

# 划分数据集,然后对训练集做增强
train_df, val_df, test_df = preprocessor.split_data(df)
augmented_train = []
for _, row in train_df.iterrows():
    augmented_train.append({'cleaned_text': row['cleaned_text'], 'label': row['label']})
    for aug in light_augment(row['cleaned_text']):
        augmented_train.append({'cleaned_text': aug, 'label': row['label']})
augmented_train_df = pd.DataFrame(augmented_train)

BERT model quick fine-tuning

Loading pre-training and data tokenization

With Hugging FaceTrainerAPI, a dozen lines of code can start fine-tuning. Load firstbert-base-chinese, and then Tokenize the text.

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments, EarlyStoppingCallback
)
from datasets import Dataset
import torch
from sklearn.metrics import accuracy_score, f1_score

# 指定预训练模型
MODEL_NAME = "bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

# Tokenization函数
def tokenize_function(examples):
    return tokenizer(
        examples["cleaned_text"],
        truncation=True,
        padding="max_length",
        max_length=128          # 根据业务文本长度调整,通常128足够了
    )

# 转换为Hugging Face Dataset格式并Token化
train_ds = Dataset.from_pandas(augmented_train_df[['cleaned_text', 'label']])
val_ds = Dataset.from_pandas(val_df[['cleaned_text', 'label']])
tokenized_train = train_ds.map(tokenize_function, batched=True).remove_columns(['cleaned_text']).rename_column('label', 'labels')
tokenized_val = val_ds.map(tokenize_function, batched=True).remove_columns(['cleaned_text']).rename_column('label', 'labels')

Training configuration and tuning

Configure early stopping mechanism and mixed precision training (if GPU is available) to quickly obtain available models:

Fine-tuning BERT in a pure CPU environment will be very slow. It is strongly recommended to use a GPU with at least 8GB of video memory, or switch to a lighter weight one`distilbert-base-chinese`。
# 自定义评估函数
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1).numpy()
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions, average='weighted')
    }

# 训练参数配置
training_args = TrainingArguments(
    output_dir="./sentiment_model",
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir="./sentiment_model/logs",
    logging_steps=20,
    evaluation_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    load_best_model_at_end=True,           # 训练结束自动加载最优模型
    metric_for_best_model="f1",            # 根据F1选择最佳模型
    greater_is_better=True,
    save_total_limit=2,                    # 只保留最近两个checkpoint
    seed=42,
    fp16=torch.cuda.is_available(),        # GPU环境下开启混合精度,加速训练
    report_to=None                         # 可选:改为'tensorboard'或'wandb'以记录日志
)

# 初始化Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]  # 连续2次评估无提升即停
)

# **实战提示**:确认GPU环境后,取消下方注释启动训练
# trainer.train()
# trainer.save_model("./sentiment_model/best_model")
# tokenizer.save_pretrained("./sentiment_model/best_model")

Model evaluation and engineering

Key indicators and lightweight visualization

After training is completed, run the final evaluation on the test set to see the model generalization effect:

from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# **实战提示**:加载最佳模型并评估,取消注释即可运行
# test_ds = Dataset.from_pandas(test_df[['cleaned_text', 'label']])
# tokenized_test = test_ds.map(tokenize_function, batched=True).remove_columns(['cleaned_text']).rename_column('label', 'labels')
# preds = trainer.predict(tokenized_test).predictions.argmax(-1)
# print(classification_report(test_df['label'], preds, target_names=['负面', '正面']))
# sns.heatmap(confusion_matrix(test_df['label'], preds), annot=True, cmap='Blues')
# plt.show()

FastAPI servitization

The launch speed of inference services often affects implementation more than the model itself. FastAPI can be used to quickly write high-performance service interfaces.

from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 初始化FastAPI应用
app = FastAPI(title="企业级情感分析API", version="1.0.0")

# 定义请求/响应数据模型
class SingleRequest(BaseModel):
    text: str
class BatchRequest(BaseModel):
    texts: list[str]
class SentimentResponse(BaseModel):
    text: str
    sentiment: str
    confidence: float

# **实战提示**:取消注释,加载训练好的模型
# MODEL_PATH = "./sentiment_model/best_model"
# tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
# model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
# model.eval()
# label_map = {0: "负面", 1: "正面"}

# 单条预测
@app.post("/predict", response_model=SentimentResponse)
async def predict(req: SingleRequest):
    # **实战提示**:替换为真实模型推理逻辑
    return {
        "text": req.text,
        "sentiment": "正面" if "好" in req.text or "棒" in req.text else "负面",
        "confidence": 0.85
    }

# 批量预测
@app.post("/batch_predict")
async def batch_predict(req: BatchRequest):
    # **实战提示**:替换为真实批量推理
    return [
        {"text": t, "sentiment": "正面" if "好" in t or "棒" in t else "负面", "confidence": 0.85}
        for t in req.texts
    ]

# 健康检查
@app.get("/")
async def health_check():
    return {"status": "healthy"}

docker-container-deployment

Package it into a Docker image so that your sentiment analysis service can be started anywhere with one click.

# Dockerfile
FROM python:3.10-slim

WORKDIR /app

# 安装编译工具(部分Python包依赖gcc编译)
RUN apt-get update && apt-get install -y gcc && rm -rf /var/lib/apt/lists/*

# 安装Python依赖
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 将代码和预下载的模型复制进镜像(避免每次构建都重复下载)
COPY main.py .
COPY ./sentiment_model/best_model /app/model

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

requirements.txtrefer to:

fastapi==0.109.0
uvicorn==0.27.0
transformers==4.37.2
torch==2.2.0
datasets==2.17.1
scikit-learn==1.4.0
pandas==2.2.0

Practical summary and best practices

Core Points

  1. Data is King: Spend time cleaning and annotating more than 10,000 business texts first. Quality is more important than quantity.
  2. Model Adaptation: First recommendationbert-base-chinese, if the speed requirements are extremely high, consider lightweight models such asdistilbert-base-chinese
  3. Fast iteration: Using Trainer’s early stopping mechanism, usually 3 to 5 epochs can converge to good results.
  4. Project priority: FastAPI builds APIs, and Docker encapsulates the environment to ensure that services are portable and reproducible.
  5. Continuous monitoring: After running online, regularly check the data distribution offset and model accuracy, and make incremental fine-tuning as needed.

Further reading

📂 Stage: Stage 4 - Pre-training model and transfer learning (application)