title: Hugging Face in practice: A complete guide to Transformers library, Pipeline and pre-trained models | Daoman PythonAI description: Gain an in-depth understanding of the Hugging Face ecosystem, including how to use the Transformers library, Datasets library, Tokenizers library, and Pipeline. Covers complete practical content such as Chinese pre-training models, model fine-tuning, and data processing. keywords: [Hugging Face, Transformers, Pipeline, pre-trained model, fine-tuning, Datasets, Tokenizers, NLP, machine learning, deep learning]

Hugging Face in action: A complete guide to Transformers library, Pipeline and pre-trained models

Yesterday the AI ​​product manager said to me, "Add a comment sentiment classification and it will be launched tomorrow"? If you write the Transformer model from 0, you won’t be able to catch up even if your hair falls out. But with Hugging Face? Make a prototype in 10 minutes!

Today's practical operation covers the core operations from one-click calling Pipeline to Chinese pre-training model implementation. Mirror sources, word segmentation, and fine-tuning pitfall prompts for domestic users have all been added👇



Hugging Face Ecosystem Quick Start

Hugging Face is no longer just a tool for "doing NLP", it is more like a "Swiss Army Knife Platform" in the AI ​​era. From finding models and data to training, inference, and even online deployment, it covers almost the entire process of modern machine learning development. For those who are just getting started, just remember these four core components:

ComponentsCore FunctionsOne sentence explanation
HubThe world's largest model/dataset/demo communityBrowsing models is like browsing an app store
Transformers100,000+ pre-trained models can be called with one clickGive you ready-made wheels, no need to reinvent them
DatasetsEfficiently load/process NLP data setsRead data and split training sets with one line of code
TokenizersUltra-high-speed tokenizer implemented in RustCut text into numbers recognized by the model, so fast

Installation and configuration (must read in China)

Installing the entire toolset is very simple and can be done with just one line using pip. But in order to make the download faster, it is strongly recommended to add the Tsinghua mirror source:

# 基础安装(默认PyTorch版本)
pip install transformers datasets torch

# 按需安装
pip install transformers[tensorflow]   # 如果你想用TensorFlow
pip install transformers[accelerate]   # 需要分布式训练时安装

# ✅ 国内清华镜像加速所有安装(墙裂推荐)
pip install transformers datasets torch -i https://pypi.tuna.tsinghua.edu.cn/simple

After installation, run this small script to verify that the environment is ready:

```python from transformers import pipeline; print("🎉 Hugging Face安装完成!") ```

If you can print out expressions and text smoothly, your Hugging Face journey has officially begun.


Pipeline 10-minute prototype development

Pipeline is the most user-friendly entrance in the entire ecosystem - it packages all the tedious steps of word segmentation, model loading, reasoning, and parsing output together, allowing you to complete the task with one line of code. For the product manager’s need to “get it today and do it tomorrow”, Pipeline is a life-saving straw.

Commonly used Chinese task demonstrations

The following demonstrates several tasks with the highest frequency of demand, and all of them use pre-trained models optimized for Chinese to ensure reliable results.

1. Sentiment classification of Chinese comments

Directly loading a model specially trained on Chinese news comment sentiments, the accuracy is much higher than fine-tuning from a general model:

from transformers import pipeline

# 加载专门做中文新闻情感的模型(准确率更高)
classifier = pipeline(
    "sentiment-analysis",
    model="uer/roberta-base-finetuned-chinanews-chinese",
    device=0 if torch.cuda.is_available() else -1  # 有GPU自动用
)

# 批量测试
test_texts = [
    "这个耳机降噪绝了,地铁上完全听不到杂音!",
    "快递员态度太差,东西也摔了个角,差评。"
]

results = classifier(test_texts)
for text, res in zip(test_texts, results):
    emoji = "👍" if res["label"] == "正面" else "👎"
    print(f"{emoji} {text}")
    print(f"   置信度:{res['score']*100:.1f}%\n")

Positive or negative labels and confidence levels are automatically output here, and you almost don’t have to worry about any details inside the model.

2. Chinese Named Entity Recognition (NER)

Need to extract names of people, places, and organizations from a large piece of text? One Pipeline is done and startedgrouped_entitiesAfter the parameters are passed, consecutive subwords will be automatically merged. For example, "Chaoyang District, Beijing" will be recognized as a complete location.

ner = pipeline(
    "ner",
    model="uer/roberta-base-finetuned-cluener2020-chinese",
    grouped_entities=True  # 自动合并连续的实体(如“北京朝阳区”→完整地名)
)

text = "2024年道满PythonAI在杭州举办了线下AI沙龙。"
entities = ner(text)
print("识别到的实体:")
for e in entities:
    print(f"- {e['entity_group']}{e['word']}")

The results of entity recognition are clear at a glance, which is very suitable for quickly building information extraction class functions.


Transformers core components revealed

Although Pipeline is convenient, if you need more fine-grained control - such as handling the input format of the model yourself, or want to extract the feature vector of the intermediate layer - you have to come into contact with the three core components of the Transformers library. They are the cornerstone of the entire library, and they all have "Auto" in their names, which means they can automatically match the correct implementation based on the model name.

Overview of the three core components

  • AutoTokenizer: Automatically loads the tokenizer matched with the model, responsible for converting text into a sequence of numbers.
  • AutoModelForXxx: Automatically load models with specific task headers, such asAutoModelForSequenceClassificationIt is a model with a classification head.
  • AutoConfig: Read or modify the configuration of the model, such as adjusting the hidden layer size, number of categories, etc.

General usage examples

We jointly released the Harbin Institute of Technology iFlytekchinese-roberta-wwm-extFor example, this is a very excellent Chinese pre-training model, suitable for most Chinese understanding tasks. The following code demonstrates how to load the tokenizer and model, and then extract the semantic features of the text:

import torch
from transformers import AutoTokenizer, AutoModel

# 🌟 推荐的通用中文预训练模型
model_name = "hfl/chinese-roberta-wwm-ext"

# 1. 自动加载分词器
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 2. 自动加载模型(AutoModel是通用特征提取,不带任务头)
model = AutoModel.from_pretrained(model_name)

# 3. 文本编码(支持批量、截断、填充)
texts = ["自然语言处理很有趣", "Hugging Face真香"]
inputs = tokenizer(
    texts,
    return_tensors="pt",  # 返回PyTorch张量
    padding=True,
    truncation=True,
    max_length=32
)

# 4. 推理取特征
with torch.no_grad():  # 推理时关闭梯度计算,节省内存
    outputs = model(**inputs)
    last_hidden_state = outputs.last_hidden_state  # 最后一层的隐藏状态(特征)

print(f"特征形状:{last_hidden_state.shape}")  # (batch_size, seq_len, hidden_size)

With these lines of code, you get a deep semantic representation of each text, which can be used in downstream tasks such as vector retrieval, clustering, and semantic similarity calculation.


Chinese pre-training model implemented in practice

There is an iron rule when doing Chinese NLP: Never directly use English pre-trained models to run Chinese data. The vocabulary list of the English model is completely different from that of Chinese. Forcibly using it will cause a large number of rare words to be cut into single characters or even garbled characters. The effect can be imagined. Fortunately, we have many models that have been carefully trained on Chinese corpus. The following table summarizes the most commonly used models with the best reputation on Hugging Face Hub today:

Model nameNumber of parametersApplicable scenariosRecommendation index
bert-base-chinese110MGeneral Chinese NLP basic tasks⭐⭐⭐
hfl/chinese-roberta-wwm-ext110MUniversal Chinese understanding (much better than BERT)⭐⭐⭐⭐⭐
uer/roberta-base-finetuned-chinanews-chinese110MChinese news/e-commerce review sentiment classification⭐⭐⭐⭐
fnlp/bart-base-chinese110MChinese text generation (summary/continuation)⭐⭐⭐⭐

Chinese e-commerce review classification implemented

Suppose you need to process a batch of e-commerce user reviews, you canchinese-roberta-wwm-extOn the basis of adding a simple classification header, build a two-classification model (such as positive/negative). This not only takes advantage of the powerful semantic understanding capabilities of the pre-trained model, but also adapts to your own classification tasks:

# 1. 先加载基础模型,手动加2分类头(正面/负面)
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2  # 二分类标签数
)

The only step left is fine-tuning. The next section will provide a minimalist fine-tuning framework for you to use directly.


Fine-tuning minimalist principles and fast frameworks

The essence of fine-tuning is to "tune" a general model that "knows astronomy from above and geography from below" and use your own data to become an expert in a certain vertical field. But before you do it, please remember the following three principles, which can help you avoid 80% of pitfalls:

✅ Fine-tune the 3 principles before stepping on the trap

  1. Data requirements: Prepare at least 100 pieces of high-quality annotated data. If the number is insufficient, priority should be given to Prompt Engineering or Few-shot Learning.
  2. Parameter Tuning: Start with small batches (e.g.per_device_train_batch_size=8), and adopt a smaller learning rate (the range of 2e-5~5e-5 is safer).
  3. Validation set: A part of the data must be separated as a verification set. During training, the performance of the verification set should be monitored in real time to prevent the model from "memorizing" the training data (overfitting).

🚀 Quick framework (just replace your data)

The code below is a complete but extremely streamlined fine-tuning process, you just need to replace it with your own CSV or dictionaryyour_dataThe rest of the content can basically be copied:

from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments
)
from sklearn.metrics import accuracy_score

# ----------------------
# 1. 替换成你的标注数据(必须有text和label列)
# ----------------------
your_data = {
    "text": ["这个好用", "这个太差了", "物流很快", "包装破损"],
    "label": [1, 0, 1, 0]
}
dataset = Dataset.from_dict(your_data).train_test_split(test_size=0.2)  # 自动拆验证集

# ----------------------
# 2. 分词处理
# ----------------------
tokenizer = AutoTokenizer.from_pretrained("hfl/chinese-roberta-wwm-ext")
def tokenize_func(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=64)
tokenized_ds = dataset.map(tokenize_func, batched=True)

# ----------------------
# 3. 训练配置与训练
# ----------------------
training_args = TrainingArguments(
    output_dir="./my_chinese_classifier",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    learning_rate=3e-5,
    evaluation_strategy="epoch",  # 每个epoch评估一次
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy"
)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=-1)
    return {"accuracy": accuracy_score(labels, predictions)}

model = AutoModelForSequenceClassification.from_pretrained(
    "hfl/chinese-roberta-wwm-ext", num_labels=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    compute_metrics=compute_metrics
)

# trainer.train()  # 取消注释开始训练
# trainer.save_model("./my_best_classifier")  # 保存最佳模型

The entire framework is clear and clear, and after adding comments, even if you are not familiar with Hugging Face, you can get started quickly. After the model training is completed, a model file that can be directly deployed will be generated in the directory you specify.


Model deployment and inference optimization

The trained model cannot be put online directly with the original PyTorch file - otherwise the inference speed will be slow, the memory will be large, and the user experience will be greatly reduced. We need to do some lightweight optimization to make the model run fast and stable in the production environment.

🎯 Local rapid deployment (using Pipeline)

Saved models can be loaded directly with Pipeline, just like using the finished product out of the box:

classifier = pipeline(
    "sentiment-analysis",
    model="./my_best_classifier",
    tokenizer="./my_best_classifier"
)

One line of code builds the service, and the next line can provide the API to the outside world, which is very convenient.

✨ 3 commonly used inference optimizations

Depending on your deployment environment, you can choose different optimization strategies:

Optimization methodEffectApplicable scenarios
FP16 Half PrecisionAbout 2 times faster, half the memoryServer or PC with NVIDIA GPU
INT8 QuantificationThe speed is increased by 1.5~2 times, and the memory is reduced to 1/4When CPU deployment and GPU memory are insufficient
ONNX conversionCross-platform, slightly fasterWhen you need to deploy on different devices/systems

These optimization methods have detailed code examples in the official documentation, and the conversion usually only takes a few lines of code to complete. Don’t forget to optimize this step before going online.


Summarize

Hugging Face has lowered the threshold for natural language processing from "master's thesis level" to "you can get started with Python". To sum up, there are four sentences:

  1. **Want to quickly test your idea? ** Using Pipeline, the prototype was produced in 10 minutes. The product manager was surprised when he saw it.
  2. **Do Chinese tasks? ** Resolutely avoid English models and directly search for pre-trained models with "chinese" on Hub.
  3. **Need a dedicated model? ** If the data is enough, fine-tune it. It can be easily done according to the three principles and quick framework provided above.
  4. **Ready to go online? ** Don’t run naked, do a round of optimization with FP16, INT8 or ONNX first.

With this toolbox, next time you encounter the need to "go online tomorrow", you can calmly reply: "Give me a cup of coffee."


Further reading

📂 Stage: Stage 4 - Pre-training model and transfer learning (application)