title: Detailed explanation of word vector space: from One-Hot to Word2Vec, GloVe and modern embedding technology principles and PyTorch implementation | Daoman PythonAI description: Deeply understand the word vector mapping principles from One-Hot to Word2Vec, GloVe, and FastText, and master the distributed representation and vector space semantics of words. Contains detailed mathematical principles, Python implementation and practical application scenarios. keywords: [Word vectors, Word2Vec, GloVe, FastText, One-Hot encoding, word embedding, distributed representation, NLP, natural language processing, PyTorch]

Detailed explanation of word vector space: from One-Hot to Word2Vec, GloVe and modern embedding technology principles and PyTorch implementation

Why do we need word vectors?

In natural language processing (NLP), we face one of the most basic problems: computers only understand numbers, not words. How to turn a sentence like "I like machine learning" into something that a computer can process?

This is the core problem that word vector technology wants to solve. It converts discrete words into continuous vector representations, allowing the computer to not only "see" the words, but also "understand" the meaning of the words.

The importance of word vectors

Word vectors are the infrastructure of NLP, and almost all modern NLP models are built on good word representation. Imagine that we assign each word a string of numbers. These numbers are no longer random but carry semantic information:

  • Semantic Understanding: Similar words will "live" very close in the vector space. For example, "cat" and "dog" should be much closer than "cat" and "car".
  • Dimensionality reduction: From ultra-high-dimensional sparse representation (tens of thousands or even hundreds of thousands of dimensions) to low-dimensional dense representation (usually 100 to 300 dimensions), the computing efficiency is greatly improved.
  • Generalization ability: The model can draw inferences from one instance and make reasonable inferences based on the structure of the vector space even when encountering similar expressions or new words.

It can be said that without good word vectors, subsequent sentiment analysis, machine translation, and question and answer systems will struggle. Next, we start from the original One-Hot encoding and see how the word vector evolves step by step.


One-Hot encoding issue

One-Hot encoding principle

One-Hot encoding is the simplest word representation method. The idea is very simple: assign a unique serial number to each word, and then convert this serial number into a vector with only 0 and 1. The length of the vector is the size of the vocabulary.

import numpy as np

def one_hot_encode(word, vocab):
    """
    One-Hot编码实现
    """
    vocab_size = len(vocab)
    word_index = vocab.index(word) if word in vocab else -1
    
    if word_index == -1:
        raise ValueError(f"词 '{word}' 不在词汇表中")
    
    # 创建One-Hot向量
    one_hot = np.zeros(vocab_size)
    one_hot[word_index] = 1
    return one_hot

# 示例
vocab = ["我", "爱", "机器", "学习", "深度", "人工智能"]
machine_vec = one_hot_encode("机器", vocab)
print(f"'机器'的One-Hot向量: {machine_vec}")
# [0. 0. 1. 0. 0. 0.]

# 计算两个词的相似度(总是0,无法表示语义关系)
learning_vec = one_hot_encode("学习", vocab)
similarity = np.dot(machine_vec, learning_vec)  # 0
print(f"'机器'和'学习'的相似度: {similarity}")

You can see that only one position in the vector is a 1, and the rest are all 0s. This representation works fine for small vocabularies, but it suffers from two vexing problems.

Main issues with One-Hot encoding

The following code highlights the common problems of One-Hot encoding:

# One-Hot编码的问题演示
def analyze_onehot_issues():
    """
    分析One-Hot编码的主要问题
    """
    vocab = ["国王", "王后", "男人", "女人", "巴黎", "法国", "柏林", "德国"]
    
    # 任意两个词的相似度都是0
    king_vec = np.array([1 if i == 0 else 0 for i in range(len(vocab))])  # 国王
    queen_vec = np.array([1 if i == 1 else 0 for i in range(len(vocab))])  # 王后
    
    similarity = np.dot(king_vec, queen_vec)
    print(f"国王和王后的相似度: {similarity}")  # 0,但实际上它们有很强的语义联系
    
    print(f"向量维度: {len(vocab)}")
    print(f"稀疏度: {1 - 1/len(vocab):.4f}")  # 接近100%稀疏

analyze_onehot_issues()

To summarize the fatal flaws of One-Hot encoding:

  1. Curse of Dimensionality: The vectors have as many dimensions as the vocabulary is large. A vocabulary of 100,000 words requires 100,000-dimensional vectors, which makes computation and storage extremely expensive.
  2. Semantic Missing: All word vectors are orthogonal, and the similarity is always 0. Therefore you can never express the analogy "King - Man + Woman = Queen".
  3. Sparse: The vector contains almost all 0s, and most of the storage space is wasted.

Obviously, for computers to truly understand language, we must move from sparse, semantic-less One-Hot vectors to dense, semantically expressive distributed representations.


Detailed explanation of the principle of Word2Vec

Word2Vec is a revolutionary method proposed by Google in 2013. It uses neural networks to learn distributed representations of words from a large amount of text, completely solving the pain points of One-Hot. The core idea of ​​Word2Vec can be summarized in one sentence: The meaning of a word is determined by the words around it.

Word2Vec mainly has two architectures: Skip-Gram and CBOW. Next, we break them down using code and popular explanations.

Skip-Gram model

The idea of ​​Skip-Gram is to use the center word to predict its context. In layman’s terms, given the word “machine learning,” a model learns to predict that words like “artificial intelligence,” “depth,” “algorithm,” etc. may appear around it. This method is particularly good at handling small amounts of data and is more friendly to rare words.

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import random

class SkipGramModel(nn.Module):
    """
    Skip-Gram模型实现
    """
    def __init__(self, vocab_size, embed_dim=100):
        super(SkipGramModel, self).__init__()
        # 中心词嵌入
        self.center_embed = nn.Embedding(vocab_size, embed_dim)
        # 上下文词嵌入(负采样用)
        self.context_embed = nn.Embedding(vocab_size, embed_dim)
        
        # 初始化权重
        initrange = 0.5 / embed_dim
        self.center_embed.weight.data.uniform_(-initrange, initrange)
        self.context_embed.weight.data.uniform_(-initrange, initrange)
    
    def forward(self, center_words, context_words, neg_words):
        """
        前向传播:计算正样本得分和负样本得分。
        """
        # 中心词嵌入
        center_embeds = self.center_embed(center_words)  # (batch_size, embed_dim)
        
        # 正样本(真实上下文)得分
        context_embeds = self.context_embed(context_words)  # (batch_size, embed_dim)
        pos_scores = torch.sum(center_embeds * context_embeds, dim=1)  # (batch_size,)
        
        # 负样本得分
        neg_embeds = self.context_embed(neg_words)  # (batch_size, k, embed_dim)
        neg_scores = torch.bmm(neg_embeds, center_embeds.unsqueeze(2)).squeeze(2)  # (batch_size, k)
        
        # 使用log-sigmoid作为损失函数
        pos_loss = F.logsigmoid(pos_scores)
        neg_loss = torch.sum(F.logsigmoid(-neg_scores), dim=1)
        
        return -(pos_loss + neg_loss).mean()

def generate_training_data(sentences, window_size=2):
    """
    生成训练数据:从句子中提取(中心词,上下文词)对。
    """
    pairs = []
    for sentence in sentences:
        for i, center_word in enumerate(sentence):
            # 定义上下文窗口
            start = max(0, i - window_size)
            end = min(len(sentence), i + window_size + 1)
            
            # 收集上下文词
            for j in range(start, end):
                if i != j:  # 不包括自己
                    pairs.append((center_word, sentence[j]))
    return pairs

# 示例数据
sentences = [
    ["我", "爱", "机器", "学习"],
    ["深度", "学习", "是", "人工智能", "的重要", "分支"],
    ["自然", "语言", "处理", "是", "有趣的", "领域"]
]

# 构建词汇表
word_to_idx = {}
idx_to_word = {}
all_words = []
for sentence in sentences:
    all_words.extend(sentence)

vocab = list(set(all_words))
for i, word in enumerate(vocab):
    word_to_idx[word] = i
    idx_to_word[i] = word

print(f"词汇表大小: {len(vocab)}")

CBOW model

The idea of ​​CBOW (Continuous Bag of Words) is exactly the opposite of Skip-Gram: use context words to predict the center word. Taking "machine learning" as an example, CBOW will guess that "learning" should appear in the middle based on the surrounding "artificial intelligence", "depth" and "algorithm". Since one update requires averaging the information of multiple context words, the training speed of CBOW is usually faster than Skip-Gram, which is very suitable for large-scale corpora.

class CBOWModel(nn.Module):
    """
    CBOW模型实现
    """
    def __init__(self, vocab_size, embed_dim=100):
        super(CBOWModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_dim)
        self.linear = nn.Linear(embed_dim, vocab_size)
        
        # 初始化权重
        initrange = 0.5 / embed_dim
        self.embeddings.weight.data.uniform_(-initrange, initrange)
        self.linear.weight.data.uniform_(-initrange, initrange)
        self.linear.bias.data.zero_()
    
    def forward(self, context_words):
        """
        context_words: (batch_size, context_size)
        """
        embeds = self.embeddings(context_words)  # (batch_size, context_size, embed_dim)
        # 平均上下文词向量
        avg_embeds = torch.mean(embeds, dim=1)  # (batch_size, embed_dim)
        scores = self.linear(avg_embeds)  # (batch_size, vocab_size)
        return F.log_softmax(scores, dim=1)

# 模拟预训练词向量的使用
def demonstrate_word2vec_usage():
    """
    演示如何使用预训练Word2Vec
    """
    # 模拟词向量
    class MockWord2Vec:
        def most_similar(self, positive=None, negative=None, topn=5):
            if positive and "机器学习" in positive:
                return [("深度学习", 0.85), ("人工智能", 0.82), ("自然语言处理", 0.78)]
            else:
                return [("王后", 0.92)]
    
    mock_model = MockWord2Vec()
    
    # 查找相似词
    similar = mock_model.most_similar(positive=["机器学习"], topn=3)
    print(f"与'机器学习'相似的词: {similar}")
    
    # 词语类比
    analogy = mock_model.most_similar(positive=["国王", "女人"], negative=["男人"], topn=1)
    print(f"国王 - 男人 + 女人 = {analogy[0][0]}")

demonstrate_word2vec_usage()

Word2Vec training process

The entire training of Word2Vec can be summarized into the following standard steps:

  1. Prepare a large amount of text corpus (Wikipedia, news, novels, etc. are all acceptable).
  2. Build a vocabulary, remove low-frequency words that appear too few times, and control the size of the vocabulary.
  3. Use a sliding window to generate (center word, context word) training samples.
  4. Randomly initialize word vectors.
  5. Through gradient descent, the objective function is continuously optimized to allow the model to give a higher probability to the real context.
  6. Save the trained word vector for use in downstream tasks.

As demonstrated by the code above, after training is completed, we can easily calculate the amazing semantic operation of "king" + "woman" - "man" which is approximately equal to "queen".


GloVe and FastText

Although Word2Vec is easy to use, it mainly relies on local context windows and sometimes ignores global statistics. Later researchers proposed GloVe and FastText, which were improved from different angles.

GloVe Principle

GloVe (Global Vectors for Word Representation) cleverly combines global word co-occurrence statistics and local context information. Its core idea is:

  1. First scan the entire corpus, count the frequency of co-occurrence of word pairs, and construct a huge co-occurrence matrix.
  2. Then use this matrix to train word vectors through decomposition or regression to ensure that the operations between vectors can directly correspond to the co-occurrence relationship between words.

Because it fully considers global statistics, GloVe is usually superior in training speed and stability, and can also produce good results on small-scale corpora.

def simulate_glove_training():
    """
    模拟GloVe训练过程
    """
    # 构建共现矩阵(简化版)
    vocab = ["我", "爱", "机器", "学习"]
    
    # 示例共现矩阵(实际会很大)
    cooccurrence = np.array([
        [0, 1, 1, 0],  # 我
        [1, 0, 0, 1],  # 爱
        [1, 0, 0, 1],  # 机器
        [0, 1, 1, 0]   # 学习
    ])
    
    print("共现矩阵示例:")
    print(cooccurrence)

simulate_glove_training()

FastText extension

FastText was proposed by Facebook. The biggest highlight is decomposing words into character n-grams. For example, the word "where" can be split into<whwhehererere>Such a substring (plus boundary characters).

This design brings three significant benefits:

  1. Processing OOV (unregistered words): Even if a word has not been seen during model training, a roughly reasonable vector can be combined through its character n-gram.
  2. More effective for languages ​​with rich morphology: French, German, Turkish, etc. have many vocabulary deformations, and FastText can capture the information of root words and affixes.
  3. The vector of a word is the average of all its character n-gram vectors: This is actually using "spelling" information to assist semantic representation.
def explain_fasttext():
    """
    解释FastText的特点
    """
    print("FastText特点:")
    print("1. 将词分解为字符n-gram")
    print("2. 可以处理未登录词(OOV)")
    print("3. 对形态丰富的语言特别有效")
    print("4. 词向量是其字符n-gram向量的平均")
    
    # 示例:单词"where"的字符n-gram
    word = "where"
    n = 3  # trigram
    ngrams = []
    padded_word = "<" + word + ">"  # 添加边界符号
    for i in range(len(padded_word) - n + 1):
        ngram = padded_word[i:i+n]
        ngrams.append(ngram)
    
    print(f"\n'{word}'的{3}-gram: {ngrams}")
    # ['<wh', 'whe', 'her', 'ere', 're>']

explain_fasttext()

If you need to work in industry and there are many spelling errors or rare words in the text, FastText is often an option worth considering.


Modern word embedding technology

Word2Vec, GloVe, and FastText are all static word vectors: no matter what sentence a word appears in, its vector always remains unchanged. But language in reality is full of ambiguity - for example, "Apple" refers to completely different things in "I ate an apple" and "I bought an Apple computer." Static word vectors cannot differentiate between these two usages.

As a result, context-sensitive dynamic word vectors came into being. Pre-trained models such as BERT, ELMo, and GPT will dynamically generate vectors based on the context of the word, allowing the same word to have different representations in different sentences.

Modern word embedding comparison

def modern_embeddings_comparison():
    """
    现代词嵌入技术对比
    """
    print("词嵌入技术演进:")
    print("1. One-Hot: 稀疏、高维、无语义")
    print("2. Word2Vec/GloVe: 稠密、低维、静态语义")
    print("3. FastText: 可处理未登录词")
    print("4. ELMo/BERT: 上下文相关、动态词向量")
    print("5. GPT系列: 生成式预训练")
    
    print("\n现代最佳实践:")
    print("- 简单任务: 使用预训练词向量(Word2Vec, GloVe)")
    print("- 复杂任务: 使用Transformer模型的隐藏状态作为词嵌入")
    print("- 资源受限: 使用轻量级模型(DistilBERT, TinyBERT)")

modern_embeddings_comparison()

Modern word embedding using HuggingFace

Now the most convenient way to do NLP projects is through HuggingFacetransformersThe library calls the pretrained model. With a few lines of code, you can get high-quality word embeddings output by models such as BERT.

def demonstrate_modern_embeddings():
    """
    演示现代词嵌入的上下文相关性
    """
    print("现代嵌入特点: 上下文相关,同一词在不同语境下有不同表示")
    print("想象一下:")
    print("- '苹果'在'我吃了一个苹果'里的向量")
    print("- '苹果'在'我买了一台苹果电脑'里的向量")
    print("这两个向量在BERT里完全不一样!")

demonstrate_modern_embeddings()

Dynamic word vectors allow NLP models to truly begin to understand "polysemy", greatly raising the ceiling for various tasks.


Practical applications and cases

Having talked about so many principles, let’s take a look at how word vectors are used in actual projects. This section uses two classic scenarios to demonstrate: text classification and similarity calculation.

Word vector application in text classification

Text classification is an entry-level task of NLP. It is required for common sentiment analysis, spam detection, and news classification. Text representation constructed with word vectors often performs far better than the traditional bag-of-words model.

def text_classification_with_embeddings():
    """
    使用词向量进行文本分类的示例
    """
    import numpy as np
    from sklearn.linear_model import LogisticRegression
    
    # 模拟词向量
    vocab = {"很好", "优秀", "糟糕", "差劲", "喜欢", "讨厌", "推荐", "不推荐"}
    word_vectors = {}
    for word in vocab:
        word_vectors[word] = np.random.rand(50)
    
    def sentence_to_vector(sentence, word_vectors, dim=50):
        """
        将句子转换为向量(简单平均词向量)
        """
        words = sentence.split()
        vectors = []
        for word in words:
            if word in word_vectors:
                vectors.append(word_vectors[word])
        
        if vectors:
            return np.mean(vectors, axis=0)
        else:
            return np.zeros(dim)
    
    # 示例数据
    texts = [
        "这个产品很好很优秀",
        "质量很不错推荐购买", 
        "很糟糕差劲不推荐",
        "质量太差劲了讨厌"
    ]
    labels = [1, 1, 0, 0]  # 1表示正面,0表示负面
    
    # 转换为向量
    X = np.array([sentence_to_vector(text, word_vectors) for text in texts])
    y = np.array(labels)
    
    print(f"特征矩阵形状: {X.shape}")
    
    # 训练分类器
    clf = LogisticRegression()
    clf.fit(X, y)
    
    # 预测新文本
    new_text = "产品质量优秀值得推荐"
    new_vec = sentence_to_vector(new_text, word_vectors)
    prediction = clf.predict([new_vec])
    
    print(f"\n新文本: {new_text}")
    print(f"预测类别: {'正面' if prediction[0] == 1 else '负面'}")

text_classification_with_embeddings()

This code shows the simplest process: first average the word vectors in each sentence to obtain the sentence vector, and then send it to the logistic regression classifier. Even with such a simple approach, good baseline scores can be obtained in many real scenarios.

Word vector similarity calculation

Another high-frequency application of word vectors is to calculate semantic similarity, which is used in recommendation systems, synonym search, information retrieval, etc.

def word_similarity_demo():
    """
    词向量相似度计算示例
    """
    from sklearn.metrics.pairwise import cosine_similarity
    
    # 模拟词向量
    words = ["机器", "学习", "深度", "人工智能", "计算机", "科学"]
    vectors = np.random.rand(len(words), 100)  # 随机向量(实际会用预训练好的)
    
    # 计算相似度矩阵
    sim_matrix = cosine_similarity(vectors)
    
    print("词向量相似度矩阵:")
    print(f"{'':<8}", end="")
    for word in words:
        print(f"{word:<8}", end="")
    print()
    
    for i, word in enumerate(words):
        print(f"{word:<8}", end="")
        for j in range(len(words)):
            print(f"{sim_matrix[i][j]:<8.3f}", end="")
        print()

word_similarity_demo()

In actual projects, just replace the random vectors with Word2Vec or GloVe's pre-trained vectors, and you can quickly build a semantic retrieval or related recommendation function.


Word vectors are the basis of NLP. It is recommended to first understand the evolution logic from One-Hot to Word2Vec, and then learn the embedding method of modern pre-training models. In actual projects, the pre-trained model of Hugging Face is used first. The efficiency and effect are much better than training from scratch.

Summarize

Word vector technology is the core foundation of natural language processing, which successfully converts discrete words into continuous, semantic-rich vector representations:

  1. Evolution: From One-Hot sparse representation to Word2Vec/GloVe’s dense representation, to context-sensitive representation of models such as BERT.
  2. Core Technology: Word2Vec’s Skip-Gram/CBOW model, GloVe’s global statistics, FastText’s character n-gram.
  3. Practical Application: Various NLP tasks such as text classification, similarity calculation, and information retrieval can directly benefit from high-quality word vectors.
  4. Modern Practice: Prioritize using frameworks such as HuggingFace to load pre-trained models, and select appropriate word embedding solutions based on task type and resource conditions.

💡 Core Point: The quality of word vectors directly affects the performance of downstream NLP tasks. In modern NLP tasks, it is recommended to use the hidden states of pre-trained Transformer models as word embeddings to obtain context-aware representations.


🔗 Extended reading

📂 Stage: Stage 1 - Text Preprocessing (Cornerstone) 🔗 Related chapters: 文本特征工程TF-IDF与相似度 · 分词技术Tokenization