title: Detailed explanation of text feature engineering: TF-IDF algorithm, similarity calculation and bag-of-word model evolution and PyTorch implementation | Daoman PythonAI description: Have an in-depth understanding of TF-IDF weight calculation, text vectorization, feature selection, and various similarity measurement methods such as cosine similarity and Euclidean distance. Contains detailed Python implementation and practical application scenarios. keywords: [TF-IDF, text feature engineering, similarity calculation, bag-of-words model, text vectorization, cosine similarity, machine learning, NLP, natural language processing, PyTorch] date: 2026-04-10 updated: 2026-04-10 author: DaomanPythonAI tags: [TF-IDF, text feature engineering, similarity calculation, bag-of-words model, machine learning, NLP]

Detailed explanation of text feature engineering: TF-IDF algorithm, similarity calculation and bag-of-word model evolution and PyTorch implementation

This article has streamlined redundant content, removed mathematical formulas, and added a minimalist implementation of PyTorch. It is controlled within 2,500 words, focusing on practical codes and engineering scenarios.

What is text feature engineering?

Machines cannot directly understand strings like "natural language processing", it only accepts numerical data. The core task of text feature engineering is to convert the original text into a numerical vector with semantic clues so that the algorithm can process it.

Its four core goals:

  1. Numerical Text: Map text into numbers.
  2. Extract key features: For example, give keywords a higher weight.
  3. Reasonable dimensionality reduction: Avoid the explosive growth of vocabulary.
  4. Preserve semantics: Keep similar texts as close as possible in the vector space.

Bag of Words model (Bag of Words)

The bag-of-words model is the "first-generation machine" for text vectorization. The idea is very simple: ** completely ignore word order and grammar, and only count the number of times each word appears in the document **. It's like pouring all the words into a bag and just looking at the number of each word.

Minimalist code implementation

With the help ofsklearnand Chinese word segmentation libraryjieba, can be achieved with just a few lines of code:

from sklearn.feature_extraction.text import CountVectorizer
import jieba

# 自定义中文分词器
def zh_tokenizer(text):
    return jieba.lcut(text)

# 示例语料
docs = [
    "自然语言处理是人工智能的重要分支",
    "机器学习和深度学习是核心技术",
    "自然语言处理和机器学习密切相关"
]

# 初始化词袋向量化器
bow_vec = CountVectorizer(
    tokenizer=zh_tokenizer,
    lowercase=False,       # 中文无需转小写
    token_pattern=None     # 使用自定义分词器
)

# 拟合并生成词袋矩阵
bow_matrix = bow_vec.fit_transform(docs)

print("词汇表:", bow_vec.get_feature_names_out())
print("词袋矩阵:\n", bow_matrix.toarray())

List of advantages and disadvantages

AdvantagesDisadvantages
Simple and easy to understand, extremely fast to implementComplete loss of word order (the vectors of "dog bites man" and "man bites dog" are the same)
Computationally efficientUnable to capture semantic associations ("car" and "vehicle" are treated as completely unrelated)
Suitable for quickly building text retrieval prototypesUnable to automatically handle high-frequency stop words such as "的" and "是"

Detailed explanation of TF-IDF algorithm

TF-IDF can be seen as an upgraded version of the bag-of-words model, which assigns a weight to each word. The core idea is: If a word appears frequently in the current document but is rare in the entire corpus, then it can represent this document well.

For example, "machine learning" is very important in technical articles, and "of" is common in all articles. TF-IDF will give a high weight to the former and a low weight to the latter.

Core weight logic

  1. Word Frequency (TF): Measures the "presence" of a word in the current document.
  2. Inverse Document Frequency (IDF): Measures the "scarcity" of a word in the entire corpus - the rarer the word, the greater the amount of information and the higher the weight.
  3. Final Weight: The product of TF and IDF.

Practical Sklearn implementation

TfidfVectorizerIt not only encapsulates the above logic, but also supports advanced functions such as n-gram and normalization, making it the first choice in projects:

from sklearn.feature_extraction.text import TfidfVectorizer
import jieba
import numpy as np

def zh_tokenizer(text):
    return jieba.lcut(text)

# 更丰富的语料
docs = [
    "自然语言处理是人工智能的重要分支",
    "机器学习和深度学习是人工智能的核心技术",
    "深度学习在计算机视觉领域应用广泛",
    "数据科学结合统计学和计算机科学"
]

# 高级向量化器配置
tfidf_vec = TfidfVectorizer(
    tokenizer=zh_tokenizer,
    max_features=1000,    # 限制词汇表大小,实现降维
    min_df=1,             # 至少在1个文档中出现
    max_df=0.8,           # 过滤掉80%以上文档都出现的词(近似停用词)
    ngram_range=(1,2),    # 同时考虑单个词与双词(如“自然语言”)
    norm='l2',            # L2归一化,便于后续计算余弦相似度
    sublinear_tf=True     # 对数缩放TF,避免高频词权重过大
)

# 生成TF-IDF矩阵
tfidf_matrix = tfidf_vec.fit_transform(docs)

# 查看第一篇文档的Top3关键词
feature_names = tfidf_vec.get_feature_names_out()
doc1_tfidf = tfidf_matrix[0].toarray().flatten()
top3_idx = np.argsort(doc1_tfidf)[::-1][:3]
print("文档1的Top3关键词:")
for idx in top3_idx:
    print(f"  {feature_names[idx]}: {doc1_tfidf[idx]:.4f}")

Similarity measurement method

After obtaining the text vector, the most common operation is to calculate text similarity, which is used in document retrieval, question and answer matching and other scenarios.

1. Cosine similarity (most commonly used)

Cosine similarity measures the degree of similarity by calculating the angle between two vectors: the smaller the angle, the higher the similarity. Its biggest advantage is that it is not affected by vector length, and is especially suitable for TF-IDF vectors after L2 normalization.

from sklearn.metrics.pairwise import cosine_similarity

# 计算整个语料的相似度矩阵
sim_matrix = cosine_similarity(tfidf_matrix)
print("语料相似度矩阵:\n", sim_matrix.round(4))

2. Comparison of other methods

MethodLogicApplicable scenarios
Euclidean distancestraight-line distance of vectors in Euclidean spaceScenarios where vector length has practical meaning (such as unnormalized word frequency)
Manhattan distanceSum of absolute differences in each dimensionSparse vector, insensitive to outliers
Jaccard similarityRatio of word set intersection size and union sizeShort text and keyword overlap analysis

Minimalist PyTorch TF-IDF implementation

Since the title mentions PyTorch, here is a minimalist, directly runnable version that allows you to customize the underlying logic or embed it into a deep learning pipeline:

import jieba
import torch
from collections import defaultdict

def zh_tokenizer(text):
    return jieba.lcut(text)

class PyTorchTFIDF:
    def __init__(self, max_features=1000, min_df=1, max_df=0.8):
        self.max_features = max_features
        self.min_df = min_df
        self.max_df = max_df
        self.vocab = None
        self.idf = None
    
    def fit(self, texts):
        # 分词
        tokenized = [zh_tokenizer(t) for t in texts]
        total_docs = len(tokenized)
        
        # 统计词频和文档频率
        word_doc_count = defaultdict(int)
        word_total_count = defaultdict(int)
        for tokens in tokenized:
            unique_tokens = set(tokens)
            for token in unique_tokens:
                word_doc_count[token] += 1
            for token in tokens:
                word_total_count[token] += 1
        
        # 过滤词汇(按文档频率与总词频)
        filtered_words = [
            w for w in word_total_count 
            if self.min_df <= word_doc_count[w] <= self.max_df * total_docs
        ]
        # 取总词频最高的max_features个词
        filtered_words.sort(key=lambda x: word_total_count[x], reverse=True)
        self.vocab = {w: i for i, w in enumerate(filtered_words[:self.max_features])}
        
        # 计算IDF(加入平滑项)
        self.idf = torch.zeros(len(self.vocab))
        for w, idx in self.vocab.items():
            self.idf[idx] = torch.log(
                torch.tensor(total_docs / (word_doc_count[w] + 1))
            ) + 1
    
    def transform(self, texts):
        tokenized = [zh_tokenizer(t) for t in texts]
        tfidf_matrix = torch.zeros(len(texts), len(self.vocab))
        
        for i, tokens in enumerate(tokenized):
            token_count = defaultdict(int)
            for t in tokens:
                if t in self.vocab:
                    token_count[t] += 1
            # 归一化TF(词频 / 文档总词数)
            doc_len = len(tokens) if len(tokens) > 0 else 1
            for t, cnt in token_count.items():
                tf = cnt / doc_len
                tfidf_matrix[i, self.vocab[t]] = tf * self.idf[self.vocab[t]]
        
        # L2归一化
        norm = torch.norm(tfidf_matrix, p=2, dim=1, keepdim=True)
        norm[norm == 0] = 1.0   # 避免空文档导致除零
        tfidf_matrix /= norm
        return tfidf_matrix

# 测试
pytorch_tfidf = PyTorchTFIDF()
pytorch_tfidf.fit(docs)
pytorch_matrix = pytorch_tfidf.transform(docs)
print("PyTorch TF-IDF矩阵形状:", pytorch_matrix.shape)

Practical applications and cases

The following shows an application that is closest to engineering: Enter a query and return the N most similar documents.

class SimpleDocSearch:
    def __init__(self, docs):
        self.docs = docs
        # 使用更稳定的sklearn封装
        self.vec = TfidfVectorizer(
            tokenizer=zh_tokenizer, 
            ngram_range=(1,2), 
            norm='l2'
        )
        self.matrix = self.vec.fit_transform(docs)
    
    def search(self, query, top_k=2):
        query_vec = self.vec.transform([query])
        sims = cosine_similarity(query_vec, self.matrix).flatten()
        top_idx = sims.argsort()[::-1][:top_k]
        return [(self.docs[i], sims[i].round(4)) for i in top_idx]

# 测试
searcher = SimpleDocSearch(docs)
print("搜索结果(查询:计算机):", searcher.search("计算机"))

Limitations and modern alternatives

Three major limitations of TF-IDF

  1. Ignore word order completely: There is no way to distinguish between "deep learning" and "learning depth".
  2. Unable to Capture Semantics: "Car" and "Sedan" have nothing to do with each other in vector space.
  3. High-dimensional sparse: When the vocabulary is extremely large, computing efficiency and memory will become bottlenecks.

Comparison of Modern Alternatives

MethodVector dimensionSemantic understandingComputational complexityTypical scenarios
TF-IDFHigh (vocabulary size)❌ NoneLowFast search, keyword extraction
Sentence-BERTLow (usually 768 dimensions)✅✅ StrongMediumSemantic similarity, question and answer matching
Doc2VecLow (100-300 dimensions)✅ MediumMediumDocument clustering, topic analysis

Summarize

TF-IDF is a must-learn introductory algorithm in NLP text feature engineering, and it is also a practical tool that the industry has long relied on**:

  1. The core logic is clear and easy to understand and debug.
  2. The calculation efficiency is extremely high and suitable for large-scale corpus processing.
  3. It is still widely used in document retrieval, keyword extraction and other scenarios.
  4. It is recommended to master TF-IDF first, and then gradually transition to more complex pre-training models.
Bag of words model → TF-IDF → Similarity calculation → Pre-trained word vector → Sentence-BERT