title: Detailed explanation of word segmentation technology: Chinese Jieba word segmentation and English WordPiece algorithm principles and PyTorch implementation | Daoman PythonAI
description: Deeply master Chinese Jieba word segmentation, English WordPiece and the BPE algorithm used in modern large models, and understand the essence of Token. Contains detailed principle analysis, Python implementation and practical application scenarios.
keywords: [Word segmentation technology, Jieba word segmentation, WordPiece, BPE algorithm, Tokenization, NLP, natural language processing, Chinese word segmentation, English word segmentation, PyTorch]
Detailed explanation of word segmentation technology: Chinese Jieba word segmentation and English WordPiece algorithm principles and PyTorch implementation
What is a participle?
The essence and importance of participles
Word segmentation is the first step in natural language processing (NLP). Its core task is to segment a continuous text sequence into the smallest meaningful semantic unit - Token.
The essence of word segmentation: segment the text into the smallest semantic units that can be processed by the model
- Original text:
"我爱自然语言处理技术"
- Chinese word segmentation results:
["我", "爱", "自然语言", "处理", "技术"]
- English word segmentation results:
["I", "love", "natural", "language", "processing"]
Difficulty of word segmentation in different languages:
- Chinese is the most difficult (no natural space separation)
- Japanese followed by (mixed Chinese characters + kana)
- English is relatively simple (with spaces as the main separation)
Why do we need word segmentation?
Machine learning models cannot understand text directly and must first convert it into numerical vectors. The word segmentation strategy directly determines how the text is "translated" to the model:
文本 → Token 序列 → 数字 ID 序列 → 嵌入向量
💡 Important Tip: The quality of word segmentation directly affects the performance of downstream NLP tasks.
For example, for "machine learning":
- Tokenizer A:
["机器学习"]→ 1 token
- Tokenizer B:
["机器", "学习"]→ 2 tokens
- Tokenizer C:
["机", "器", "学", "习"]→ 4 tokens
Different segmentation methods will allow the model to learn completely different semantics.
Chinese word segmentation: Jieba detailed explanation
Jieba is currently the most popular open source Chinese word segmentation tool, known for its efficiency and accuracy, and is especially suitable for rapid integration in various Python projects.
Installation and basic usage
Jieba provides three core word segmentation modes, suitable for different scenarios:
import jieba
text = "自然语言处理是人工智能的重要分支"
# 1. 精确模式(最常用):适合文本分析,力求准确
words_precise = jieba.lcut(text)
print("精确模式:", words_precise)
# ['自然语言处理', '是', '人工智能', '的', '重要', '分支']
# 2. 全模式:扫描所有可能的词语,速度快但可能有冗余
words_full = jieba.lcut(text, cut_all=True)
print("全模式:", words_full)
# ['自然', '自然语言', '语言', '言处', '处理', '是', '人工智能', '人工', '智能', '的', '重要', '分支']
# 3. 搜索引擎模式:在精确模式基础上对长词再次切分,适合召回
words_search = jieba.cut_for_search(text)
print("搜索引擎模式:", list(words_search))
# ['自然', '语言', '自然语言', '处理', '人工智能', '人工', '智能', '是', '的', '重要', '分支']
Custom dictionaries and enhanced functions
For professional fields (such as medical, legal, and financial), Jieba's default dictionary may not be enough. We can add custom vocabulary:
# 方式一:动态添加词汇
jieba.add_word("自然语言处理")
jieba.add_word("大模型")
jieba.add_word("Transformer")
# 方式二:加载自定义词典文件
# custom_dict.txt 格式:词语 词频 词性(词频和词性可选)
# 示例内容:
# 自然语言处理 5 nz
# 大模型 10 nz
# Transformer 3 eng
# jieba.load_userdict("custom_dict.txt")
# 验证效果
text = "自然语言处理和大模型是AI领域的热点技术"
print(jieba.lcut(text))
# ['自然语言处理', '和', '大模型', '是', 'AI', '领域', '的', '热点', '技术']
Part-of-speech tagging and keyword extraction
In addition to word segmentation, Jieba also provides part-of-speech tagging and keyword extraction functions, which can provide rich features for subsequent tasks.
import jieba.posseg as pseg
import jieba.analyse
# 1. 词性标注
text = "小明在北京大学的图书馆里读书"
words_with_pos = pseg.cut(text)
print("词性标注结果:")
for word, flag in words_with_pos:
print(f"{word} / {flag}")
# 小明 / nr (人名)
# 在 / p (介词)
# 北京大学 / nt (机构名)
# ...
# 2. 关键词提取
text = """
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。
它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。
"""
# 基于 TF-IDF
keywords_tfidf = jieba.analyse.extract_tags(text, topK=5, withWeight=True)
print("\nTF-IDF关键词:")
for kw, w in keywords_tfidf:
print(f"{kw}: {w:.4f}")
# 基于 TextRank
keywords_textrank = jieba.analyse.textrank(text, topK=5, withWeight=True)
print("\nTextRank关键词:")
for kw, w in keywords_textrank:
print(f"{kw}: {w:.4f}")
English word segmentation: WordPiece and BPE
Although English is separated by spaces, it needs to deal with abbreviations, tenses and rare words. Modern large models usually use Subword Tokenization technology.
Traditional space word segmentation and preprocessing
The simplest English word segmentation is to segment by spaces, but it needs to deal with punctuation and capitalization:
import re
def basic_tokenize(text):
"""基础英文分词:清洗标点并按空格分割"""
text = re.sub(r'[^\w\s]', ' ', text.lower())
return [token for token in text.split() if token]
text = "Natural language processing is fascinating!"
print("基础分词:", basic_tokenize(text))
# ['natural', 'language', 'processing', 'is', 'fascinating']
Subword segmentation: BPE vs WordPiece
Subword segmentation can balance vocabulary size and unknown word processing capabilities. The core idea is to split rare words into common subwords. We can use it directlytransformersThe library experiences mainstream tokenizers:
from transformers import BertTokenizer, RobertaTokenizer, GPT2Tokenizer
text_en = "Natural language processing and tokenization"
# 1. BERT (WordPiece)
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print("BERT分词:", bert_tokenizer.tokenize(text_en))
# 2. RoBERTa/GPT-2 (BPE)
roberta_tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
print("RoBERTa分词:", roberta_tokenizer.tokenize(text_en))
# 3. 中文BERT (字级别分词)
chinese_bert = BertTokenizer.from_pretrained("bert-base-chinese")
print("中文BERT分词:", chinese_bert.tokenize("自然语言处理很有趣"))
# ['自', '然', '语', '言', '处', '理', '很', '有', '趣']
Comparison of mainstream tokenizers
The following table compares the currently popular subword segmenters. They differ in design concepts and applicable scenarios:
Comparison of modern tokenizers
We can write a simple function to visually compare the effects of different tokenizers:
from transformers import AutoTokenizer
def compare_tokenizers(text):
"""对比主流分词器效果"""
models = [
("bert-base-uncased", "BERT"),
("roberta-base", "RoBERTa"),
("gpt2", "GPT-2"),
]
print(f"原始文本: {text}\n")
for model_name, model_type in models:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"{model_type}:")
print(f" Tokens: {tokens}")
print(f" IDs: {ids}\n")
# 示例
compare_tokenizers("Unconventional tokenization methods are cool!")
Implementation of universal word segmentation process
We integrated the above knowledge and encapsulated a universal word segmenter that supports automatic detection of Chinese and English. It not only uses Jieba to process Chinese, but also uses the Hugging Face word segmenter to process English, and can output tensor encoding that can be used by the model.
import jieba
import re
from transformers import BertTokenizer
from typing import List
class UniversalTokenizer:
"""通用中英文分词器"""
def __init__(self, lang="auto", zh_model="bert-base-chinese", en_model="bert-base-uncased"):
self.lang = lang
self.zh_tokenizer = BertTokenizer.from_pretrained(zh_model)
self.en_tokenizer = BertTokenizer.from_pretrained(en_model)
def _detect_lang(self, text: str) -> str:
"""简单语言检测:中文占比超过30%则认为是中文"""
chinese_chars = len(re.findall(r'[\u4e00-\u9fff]', text))
total_chars = len(re.findall(r'\w', text))
return "zh" if total_chars > 0 and chinese_chars / total_chars > 0.3 else "en"
def tokenize(self, text: str) -> List[str]:
"""主分词函数"""
actual_lang = self._detect_lang(text) if self.lang == "auto" else self.lang
if actual_lang == "zh":
# 中文:Jieba精确模式 + 简单清洗
clean_text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9\s]', ' ', text)
return list(jieba.lcut(clean_text.strip()))
else:
# 英文:BERT分词
return self.en_tokenizer.tokenize(text.lower())
def encode(self, text: str, max_len=128):
"""编码为模型输入"""
actual_lang = self._detect_lang(text) if self.lang == "auto" else self.lang
tokenizer = self.zh_tokenizer if actual_lang == "zh" else self.en_tokenizer
return tokenizer(text, padding=True, truncation=True, max_length=max_len, return_tensors="pt")
# 使用示例
tokenizer = UniversalTokenizer()
print("中文分词:", tokenizer.tokenize("自然语言处理和大模型"))
print("英文分词:", tokenizer.tokenize("Natural language processing"))
Practical applications and cases
Word segmentation in sentiment analysis
In sentiment analysis tasks, we usually need to filter stop words after word segmentation to retain words with real emotional color:
def sentiment_tokenize(text: str):
"""情感分析专用分词流水线"""
tokens = jieba.lcut(text)
# 简化版停用词表
stopwords = {'的', '了', '在', '是', '我', '有', '和', '就', '不', '很'}
return [t for t in tokens if t not in stopwords and len(t) > 1]
text = "这个产品真的很好用,强烈推荐!"
print("情感分析分词:", sentiment_tokenize(text))
Text similarity calculation
Based on the word segmentation results, we can use TF-IDF to calculate text similarity, which is very practical in scenarios such as document retrieval and deduplication:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def cal_similarity(text1, text2):
"""基于分词的余弦相似度计算"""
# 先分词
tok1 = " ".join(jieba.lcut(text1))
tok2 = " ".join(jieba.lcut(text2))
# 计算TF-IDF
vec = TfidfVectorizer()
tfidf = vec.fit_transform([tok1, tok2])
# 计算余弦相似度
return cosine_similarity(tfidf[0:1], tfidf[1:2])[0][0]
text_a = "我喜欢学习自然语言处理"
text_b = "自然语言处理是我喜欢的方向"
print(f"相似度: {cal_similarity(text_a, text_b):.4f}")
Word segmentation is the cornerstone of NLP. It is recommended to master basic tools such as Jieba first, and then deeply understand the sub-word segmentation algorithm. In actual projects, Hugging Face’s pre-trained word segmenter is preferred.
Summarize
Word segmentation technology is the basic link of NLP. This article mainly introduces:
- Chinese word segmentation: Jieba’s three modes, custom dictionary and part-of-speech tagging
- English/Subword Segmentation: The principles of BPE, WordPiece and the use of Hugging Face
- Practical integration: A universal word segmenter that supports Chinese and English
- Scenario Application: Sentiment Analysis and Text Similarity Calculation
💡 Core Points: Jieba (universal) or professional word segmenter is recommended for Chinese scenarios, and Hugging Face pre-trained word segmenter is directly used for English/multi-language scenarios.
🔗 Extended reading
📂 Stage: Stage 1 - Text Preprocessing (Cornerstone)
🔗 Related chapters: 文本清洗与规范化 · 词向量空间WordEmbeddings