Text cleaning and normalization: stop word filtering, regular expressions and stemming
📂 Stage: Stage 1 - Text Preprocessing (Cornerstone) 🔗 Related chapters: 分词技术 · 文本特征工程
1. Why is text cleaning needed?
The unstructured text we come into contact with is rarely “ready to use”. Whether it’s crawled web pages, social media posts, or customer service chat records, they are filled with all kinds of distracting information:
These "dirty things" will distract model attention, reduce computational efficiency, and interfere with the final results:
- Links and emoticons do not contain business semantics at all, but occupy character positions;
- HTML tags are just layout tags and do not contribute to semantic understanding;
- Repeated punctuation and modal particles ("!!!", "Oh
oh") introduce redundant features.
Ideally, we only retain the most core word sequences:
["年度", "大促", "手慢", "无", "仅限", "新粉"]
💡 The goal of text cleaning can be summarized in one sentence: ** "Clean" the text, leaving only the information useful for the task, and throwing away the noise. **
2. Basic cleaning “toolbox”
Most of the daily cleaning operations can be completed by Python regular + standard library. When encountering complex scenarios, consider third-party libraries.
2.1 Remove HTML tags
Simple tags can be solved with regular expressions, but if you encounter nested structures or incomplete tags, it is more recommended to uselxmlto analyze.
2.2 Clean up URLs, email addresses and redundant placeholders
In general NLP tasks such as classification and summarization, this information can usually be removed directly because they carry limited semantics.
⚠️ Note: If you are doing a task specifically to extract links or emails, you cannot clean them like this, but keep them.
2.3 Full-width to half-width + Unicode normalization
Character ambiguities caused by cross-platform or input methods, such as full-width "HELLO" and half-width "HELLO" should be regarded as the same word, so they need to be unified.
2.4 Remove duplicate redundancy + extra spaces
The “ahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh love”, continuous line breaks, tabs” common in social media will cause feature bloat and need to be compressed.
3. Stop word filtering: remove "semantic noise"
Stop words refer to words that appear frequently but carry little business semantics, such as "的, 了, and" in Chinese and "the, a, is" in English. Filtering them out not only reduces the feature dimension, but also allows the model to focus more on truly meaningful content.
3.1 Basic use of Chinese stop words
We can use it firstjiebaWord segmentation, and then filtering out useless words based on the stop word list.
3.2 Custom stop word loading
In real business, we need to supplement specific stop words according to scenarios. For example, "Hello, patient" can be removed in the medical field, and "Dear, free shipping" can be removed in the e-commerce field. These words are usually saved in local files.
🧠 Tips: The deactivation vocabulary list is not static. It is recommended to continue iterating according to the badcase of specific tasks. For example, in sentiment analysis tasks, adverbs such as "very" and "extremely" are actually very important and should not be deleted easily.
4. English exclusive: stemming vs lemmatization
Chinese does not have complex morphological changes, but English words will change (running、runs、ran), if treated as three independent words, the feature dimension will be expanded. So they need to be unified into prototypes or stems.
4.1 Stemming
Stemming is fast, but the results are not guaranteed to be valid English words. It is suitable for tasks that do not require high word morphology, such as keyword extraction and rough classification.
4.2 Lemmatization
Lemmatization requires Part of Speech Tagging (POS Tag) to accurately restore, and what you get are real dictionary words. Suitable for translation, question and answer and other tasks that require high accuracy.
📌 Which method to choose? – Pursue speed and feature dimension: select word stem extraction – Pursuit of accuracy and strict requirements for downstream tasks: choose lemmatization
5. Quickly build a preprocessing pipeline
Encapsulate the previous steps into classes, which can be reused and facilitate batch processing of text in different formats. The example below supports both Chinese and English modes.
✅ After pipeline processing, the dirty text is converted into a clean word list that can be fed directly to the model.
6. Guide to avoid pitfalls: Don’t “over-clean”
Text cleaning is not as clean as possible. In many cases, you need to be "lenient" based on specific tasks:
- Sentiment Analysis: Exclamation marks, question marks, emoticons ("😠", "😊") often contain strong emotional tendencies and should not be deleted.
- Named Entity Recognition (NER): Numbers and full-width characters may appear in person names and company names (such as "Zhang San (General Manager)", "ABC Co., Ltd."), cannot be forcibly converted to half-width or deleted.
- Machine translation: Repeated characters and specific punctuation (such as book title numbers, quotation marks) may be part of the stylistic expression and need to be retained.
🧼 The ultimate principle of cleaning: Based on task requirements, it can remove noise without losing important information.
7. Quick summary
🔗 Extended reading
✅ At this point, you have mastered the entire process of text cleaning. The next step is to feed the clean data into feature engineering or models to start the real NLP journey!

