---
title: 文本清洗与规范化
description: 掌握文本预处理的完整流程:停用词过滤、正则表达式高级应用、词干提取、Unicode 规范化与大小写转换。
---

Text cleaning and normalization: stop word filtering, regular expressions and stemming

📂 Stage: Stage 1 - Text Preprocessing (Cornerstone) 🔗 Related chapters: 分词技术 · 文本特征工程


1. Why is text cleaning needed?

The unstructured text we come into contact with is rarely “ready to use”. Whether it’s crawled web pages, social media posts, or customer service chat records, they are filled with all kinds of distracting information:

原始网页推广文本:
"<p>【年度大促】🔥 手慢无!!!👉 跳转 https://example-discount.com 🎁 仅限新粉哦~</p>"

These "dirty things" will distract model attention, reduce computational efficiency, and interfere with the final results:

  • Links and emoticons do not contain business semantics at all, but occupy character positions;
  • HTML tags are just layout tags and do not contribute to semantic understanding;
  • Repeated punctuation and modal particles ("!!!", "Ohoh") introduce redundant features.

Ideally, we only retain the most core word sequences: ["年度", "大促", "手慢", "无", "仅限", "新粉"]

💡 The goal of text cleaning can be summarized in one sentence: ** "Clean" the text, leaving only the information useful for the task, and throwing away the noise. **


2. Basic cleaning “toolbox”

Most of the daily cleaning operations can be completed by Python regular + standard library. When encountering complex scenarios, consider third-party libraries.

2.1 Remove HTML tags

Simple tags can be solved with regular expressions, but if you encounter nested structures or incomplete tags, it is more recommended to uselxmlto analyze.

import re
from html import unescape
from lxml import html

def remove_html(text, use_lxml=False):
    """去除 HTML 标签 + 转义字符(如 &lt; → <)"""
    # 先统一处理转义字符
    text = unescape(text)
    if use_lxml:
        # lxml 处理嵌套/残缺标签更稳定
        try:
            return html.fromstring(text).text_content()
        except Exception:
            pass  # 如果解析失败,就回退到正则
    # 回退方案:正则直接去掉所有标签
    return re.sub(r'<[^>]+>', '', text)

text = "<p>这是一段<strong>嵌套HTML</strong>文本,还有<未闭合标签"
print(remove_html(text, use_lxml=True))
# 输出:这是一段嵌套HTML文本,还有

2.2 Clean up URLs, email addresses and redundant placeholders

In general NLP tasks such as classification and summarization, this information can usually be removed directly because they carry limited semantics.

def clean_noise(text):
    """组合清理 URL、邮箱、占位数字(可选)"""
    # 清理 http/https/www 开头的 URL
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    # 清理标准格式邮箱
    text = re.sub(r'\S+@\S+\.\S+', '', text)
    # 可选:清理占位符类数字(比如工单ID、身份证片段提示)
    # text = re.sub(r'\d{4,}', '', text)
    return text

text = "工单处理进度可戳 https://xxx.com/ticket ,或发邮件至 service@xxx.com"
print(clean_noise(text))
# 工单处理进度可戳 ,或发邮件至 

⚠️ Note: If you are doing a task specifically to extract links or emails, you cannot clean them like this, but keep them.

2.3 Full-width to half-width + Unicode normalization

Character ambiguities caused by cross-platform or input methods, such as full-width "HELLO" and half-width "HELLO" should be regarded as the same word, so they need to be unified.

import unicodedata

def full_to_half(text):
    """全角英数/空格/符号转半角"""
    result = []
    for char in text:
        code = ord(char)
        # 全角空格 → 普通空格
        if code == 12288:
            code = 32
        # 全角字符区间(从!到~)
        elif 65281 <= code <= 65374:
            code -= 65248
        result.append(chr(code))
    return "".join(result)

def normalize_text(text):
    """统一字符编码(NFC最常用),还可以组合消除装饰符"""
    # NFC:组合字符标准化(比如 "a" + "~" → "ã")
    text = unicodedata.normalize("NFC", text)
    # 可选:删除非间距装饰符(比如 "ñ" → "n",需谨慎)
    # text = "".join([c for c in text if not unicodedata.combining(c)])
    return text

text = "HELLO~123,我今年25岁啦"
text = full_to_half(normalize_text(text))
print(text)
# HELLO~123,我今年25岁啦

2.4 Remove duplicate redundancy + extra spaces

The “ahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh love”, continuous line breaks, tabs” common in social media will cause feature bloat and need to be compressed.

def clean_duplicate_and_space(text, max_char_repeat=2):
    """压缩连续重复字符 + 合并多余空白"""
    # 压缩重复字符(比如“啊啊啊啊”→“啊啊”)
    text = re.sub(r'(.)\1{%d,}' % max_char_repeat, r'\1' * max_char_repeat, text)
    # 合并换行/制表符/空格,并去掉首尾空白
    text = re.sub(r'\s+', ' ', text).strip()
    return text

text = "  啊啊啊啊啊   今天真的!真的!真的!开心到飞起\n\n"
print(clean_duplicate_and_space(text))
# 啊啊 今天真的!真的!开心到飞起

3. Stop word filtering: remove "semantic noise"

Stop words refer to words that appear frequently but carry little business semantics, such as "的, 了, and" in Chinese and "the, a, is" in English. Filtering them out not only reduces the feature dimension, but also allows the model to focus more on truly meaningful content.

3.1 Basic use of Chinese stop words

We can use it firstjiebaWord segmentation, and then filtering out useless words based on the stop word list.

import jieba

# 基础中文停用词表(可根据业务补充,比如“点击、了解”这种通用推广词)
STOPWORDS_ZH = {
    '的', '了', '和', '是', '就', '都', '而', '及', '与', '着',
    '或', '一个', '没有', '我们', '你们', '他们', '这个', '那个',
    '啊', '呀', '呢', '吧', '吗', '哦', '哈', '嗯', '哎',
    '了', '着', '过', '的', '地', '得',
}

def filter_stopwords(tokens, stopwords, min_len=1):
    """过滤停用词 + 过滤过短词(比如单字语气词/标点残留)"""
    return [
        t for t in tokens
        if t not in stopwords and len(t) >= min_len
    ]

# 测试
text = "我今天在省图书馆认真学习自然语言处理技术"
tokens = jieba.lcut(text)
filtered_tokens = filter_stopwords(tokens, STOPWORDS_ZH, min_len=2)
print(filtered_tokens)
# ['今天', '图书馆', '认真', '学习', '自然语言处理', '技术']

3.2 Custom stop word loading

In real business, we need to supplement specific stop words according to scenarios. For example, "Hello, patient" can be removed in the medical field, and "Dear, free shipping" can be removed in the e-commerce field. These words are usually saved in local files.

def load_custom_stopwords(file_path):
    """从本地UTF-8文件加载停用词(一行一个)"""
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            return set(line.strip() for line in f if line.strip())
    except FileNotFoundError:
        print(f"⚠️ 停用词文件 {file_path} 不存在,使用默认空表")
        return set()

# 假设 stopwords_zh_ecommerce.txt 存了“亲、包邮、限时”
custom_stopwords = load_custom_stopwords("stopwords_zh_ecommerce.txt")

🧠 Tips: The deactivation vocabulary list is not static. It is recommended to continue iterating according to the badcase of specific tasks. For example, in sentiment analysis tasks, adverbs such as "very" and "extremely" are actually very important and should not be deleted easily.


4. English exclusive: stemming vs lemmatization

Chinese does not have complex morphological changes, but English words will change (runningrunsran), if treated as three independent words, the feature dimension will be expanded. So they need to be unified into prototypes or stems.

4.1 Stemming

Stemming is fast, but the results are not guaranteed to be valid English words. It is suitable for tasks that do not require high word morphology, such as keyword extraction and rough classification.

import nltk
from nltk.stem import PorterStemmer, LancasterStemmer

# 首次使用需下载punkt分词器
nltk.download('punkt', quiet=True)

porter = PorterStemmer()        # 温和型,保留更多语义
lancaster = LancasterStemmer()  # 激进型,压缩更狠

words = ["running", "runs", "ran", "runner", "easily", "fairly"]
print("Porter词干:", [porter.stem(w) for w in words])
print("Lancaster词干:", [lancaster.stem(w) for w in words])

# 输出
# Porter词干: ['run', 'run', 'ran', 'runner', 'easili', 'fairli']
# Lancaster词干: ['run', 'run', 'run', 'run', 'eas', 'fair']

4.2 Lemmatization

Lemmatization requires Part of Speech Tagging (POS Tag) to accurately restore, and what you get are real dictionary words. Suitable for translation, question and answer and other tasks that require high accuracy.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(tag):
    """把nltk的词性标签转为wordnet可用的"""
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # 默认名词

text = "He is running faster than me because he runs every day and ran yesterday"
tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)

lemmas = [
    lemmatizer.lemmatize(token, get_wordnet_pos(tag))
    for token, tag in pos_tags
]
print(lemmas)
# ['He', 'be', 'run', 'faster', 'than', 'me', 'because', 'he', 'run', 'every', 'day', 'and', 'run', 'yesterday']

📌 Which method to choose? – Pursue speed and feature dimension: select word stem extraction – Pursuit of accuracy and strict requirements for downstream tasks: choose lemmatization


5. Quickly build a preprocessing pipeline

Encapsulate the previous steps into classes, which can be reused and facilitate batch processing of text in different formats. The example below supports both Chinese and English modes.

from functools import reduce

class SimpleTextPreprocessor:
    def __init__(self, stopwords=None, lang='zh'):
        self.stopwords = stopwords or set()
        self.lang = lang

    # 中文/英文通用清洗
    def _clean_general(self, text):
        steps = [
            lambda x: remove_html(x, use_lxml=True),
            clean_noise,
            normalize_text,
            lambda x: full_to_half(x) if self.lang == 'zh' else x,
            clean_duplicate_and_space,
        ]
        return reduce(lambda t, fn: fn(t), steps, text)

    # 中文分词 + 过滤
    def _process_zh(self, text):
        tokens = jieba.lcut(text)
        return filter_stopwords(tokens, self.stopwords, min_len=2)

    # 英文分词 + 词形还原(也可选词干)
    def _process_en(self, text, use_lemmatizer=True):
        tokens = nltk.word_tokenize(text.lower())  # 英文通常统一小写
        if use_lemmatizer:
            pos_tags = nltk.pos_tag(tokens)
            tokens = [
                lemmatizer.lemmatize(t, get_wordnet_pos(tag))
                for t, tag in pos_tags
            ]
        else:
            tokens = [porter.stem(t) for t in tokens]
        # 同样可以过滤掉长度过短的词
        return [t for t in tokens if len(t) >= 2]

    # 主入口
    def process(self, text, **kwargs):
        cleaned = self._clean_general(text)
        if self.lang == 'zh':
            return self._process_zh(cleaned)
        else:
            return self._process_en(cleaned, **kwargs)

# 测试中文
zh_processor = SimpleTextPreprocessor(stopwords=STOPWORDS_ZH, lang='zh')
zh_raw = "<p>🔥🔥🔥 年度大促!!!快来 https://xxx.com 抢!抢!抢!亲,仅限新粉哦~</p>"
print(zh_processor.process(zh_raw))
# 输出:['年度', '大促', '快来', '仅限', '新粉']

✅ After pipeline processing, the dirty text is converted into a clean word list that can be fed directly to the model.


6. Guide to avoid pitfalls: Don’t “over-clean”

Text cleaning is not as clean as possible. In many cases, you need to be "lenient" based on specific tasks:

  • Sentiment Analysis: Exclamation marks, question marks, emoticons ("😠", "😊") often contain strong emotional tendencies and should not be deleted.
  • Named Entity Recognition (NER): Numbers and full-width characters may appear in person names and company names (such as "Zhang San (General Manager)", "ABC Co., Ltd."), cannot be forcibly converted to half-width or deleted.
  • Machine translation: Repeated characters and specific punctuation (such as book title numbers, quotation marks) may be part of the stylistic expression and need to be retained.

🧼 The ultimate principle of cleaning: Based on task requirements, it can remove noise without losing important information.


7. Quick summary

OperationsCommon Tools/Code Snippets
Remove HTML tagslxml.html.fromstring(text).text_content()
Clean up URL/emailre.sub(r'https?://\S+|www\.\S+|\S+@\S+\.\S+', '', text)
Full-width to half-widthCustomizedfull_to_halffunction
Unicode normalizationunicodedata.normalize("NFC", text)
Compression redundancyre.sub(r'(.)\1{2,}', r'\1'*2, text) + re.sub(r'\s+', ' ', text)
Chinese stop word filteringjieba.lcut(text)+ Custom stop word list
English lemmatizationWordNetLemmatizer+ POS tagging

🔗 Extended reading


✅ At this point, you have mastered the entire process of text cleaning. The next step is to feed the clean data into feature engineering or models to start the real NLP journey!