title: Detailed explanation of word vector space: from One-Hot to Word2Vec, GloVe and modern embedding technology principles and PyTorch implementation | Daoman PythonAI description: Deeply understand the word vector mapping principles from One-Hot to Word2Vec, GloVe, and FastText, and master the distributed representation and vector space semantics of words. Contains detailed mathematical principles, Python implementation and practical application scenarios. keywords: [Word vectors, Word2Vec, GloVe, FastText, One-Hot encoding, word embedding, distributed representation, NLP, natural language processing, PyTorch]
Detailed explanation of word vector space: from One-Hot to Word2Vec, GloVe and modern embedding technology principles and PyTorch implementation
Why do we need word vectors?
In natural language processing (NLP), we face one of the most basic problems: computers only understand numbers, not words. How to turn a sentence like "I like machine learning" into something that a computer can process?
This is the core problem that word vector technology wants to solve. It converts discrete words into continuous vector representations, allowing the computer to not only "see" the words, but also "understand" the meaning of the words.
The importance of word vectors
Word vectors are the infrastructure of NLP, and almost all modern NLP models are built on good word representation. Imagine that we assign each word a string of numbers. These numbers are no longer random but carry semantic information:
- Semantic Understanding: Similar words will "live" very close in the vector space. For example, "cat" and "dog" should be much closer than "cat" and "car".
- Dimensionality reduction: From ultra-high-dimensional sparse representation (tens of thousands or even hundreds of thousands of dimensions) to low-dimensional dense representation (usually 100 to 300 dimensions), the computing efficiency is greatly improved.
- Generalization ability: The model can draw inferences from one instance and make reasonable inferences based on the structure of the vector space even when encountering similar expressions or new words.
It can be said that without good word vectors, subsequent sentiment analysis, machine translation, and question and answer systems will struggle. Next, we start from the original One-Hot encoding and see how the word vector evolves step by step.
One-Hot encoding issue
One-Hot encoding principle
One-Hot encoding is the simplest word representation method. The idea is very simple: assign a unique serial number to each word, and then convert this serial number into a vector with only 0 and 1. The length of the vector is the size of the vocabulary.
You can see that only one position in the vector is a 1, and the rest are all 0s. This representation works fine for small vocabularies, but it suffers from two vexing problems.
Main issues with One-Hot encoding
The following code highlights the common problems of One-Hot encoding:
To summarize the fatal flaws of One-Hot encoding:
- Curse of Dimensionality: The vectors have as many dimensions as the vocabulary is large. A vocabulary of 100,000 words requires 100,000-dimensional vectors, which makes computation and storage extremely expensive.
- Semantic Missing: All word vectors are orthogonal, and the similarity is always 0. Therefore you can never express the analogy "King - Man + Woman = Queen".
- Sparse: The vector contains almost all 0s, and most of the storage space is wasted.
Obviously, for computers to truly understand language, we must move from sparse, semantic-less One-Hot vectors to dense, semantically expressive distributed representations.
Detailed explanation of the principle of Word2Vec
Word2Vec is a revolutionary method proposed by Google in 2013. It uses neural networks to learn distributed representations of words from a large amount of text, completely solving the pain points of One-Hot. The core idea of Word2Vec can be summarized in one sentence: The meaning of a word is determined by the words around it.
Word2Vec mainly has two architectures: Skip-Gram and CBOW. Next, we break them down using code and popular explanations.
Skip-Gram model
The idea of Skip-Gram is to use the center word to predict its context. In layman’s terms, given the word “machine learning,” a model learns to predict that words like “artificial intelligence,” “depth,” “algorithm,” etc. may appear around it. This method is particularly good at handling small amounts of data and is more friendly to rare words.
CBOW model
The idea of CBOW (Continuous Bag of Words) is exactly the opposite of Skip-Gram: use context words to predict the center word. Taking "machine learning" as an example, CBOW will guess that "learning" should appear in the middle based on the surrounding "artificial intelligence", "depth" and "algorithm". Since one update requires averaging the information of multiple context words, the training speed of CBOW is usually faster than Skip-Gram, which is very suitable for large-scale corpora.
Word2Vec training process
The entire training of Word2Vec can be summarized into the following standard steps:
- Prepare a large amount of text corpus (Wikipedia, news, novels, etc. are all acceptable).
- Build a vocabulary, remove low-frequency words that appear too few times, and control the size of the vocabulary.
- Use a sliding window to generate (center word, context word) training samples.
- Randomly initialize word vectors.
- Through gradient descent, the objective function is continuously optimized to allow the model to give a higher probability to the real context.
- Save the trained word vector for use in downstream tasks.
As demonstrated by the code above, after training is completed, we can easily calculate the amazing semantic operation of "king" + "woman" - "man" which is approximately equal to "queen".
GloVe and FastText
Although Word2Vec is easy to use, it mainly relies on local context windows and sometimes ignores global statistics. Later researchers proposed GloVe and FastText, which were improved from different angles.
GloVe Principle
GloVe (Global Vectors for Word Representation) cleverly combines global word co-occurrence statistics and local context information. Its core idea is:
- First scan the entire corpus, count the frequency of co-occurrence of word pairs, and construct a huge co-occurrence matrix.
- Then use this matrix to train word vectors through decomposition or regression to ensure that the operations between vectors can directly correspond to the co-occurrence relationship between words.
Because it fully considers global statistics, GloVe is usually superior in training speed and stability, and can also produce good results on small-scale corpora.
FastText extension
FastText was proposed by Facebook. The biggest highlight is decomposing words into character n-grams. For example, the word "where" can be split into<wh、whe、her、ere、re>Such a substring (plus boundary characters).
This design brings three significant benefits:
- Processing OOV (unregistered words): Even if a word has not been seen during model training, a roughly reasonable vector can be combined through its character n-gram.
- More effective for languages with rich morphology: French, German, Turkish, etc. have many vocabulary deformations, and FastText can capture the information of root words and affixes.
- The vector of a word is the average of all its character n-gram vectors: This is actually using "spelling" information to assist semantic representation.
If you need to work in industry and there are many spelling errors or rare words in the text, FastText is often an option worth considering.
Modern word embedding technology
Word2Vec, GloVe, and FastText are all static word vectors: no matter what sentence a word appears in, its vector always remains unchanged. But language in reality is full of ambiguity - for example, "Apple" refers to completely different things in "I ate an apple" and "I bought an Apple computer." Static word vectors cannot differentiate between these two usages.
As a result, context-sensitive dynamic word vectors came into being. Pre-trained models such as BERT, ELMo, and GPT will dynamically generate vectors based on the context of the word, allowing the same word to have different representations in different sentences.
Modern word embedding comparison
Modern word embedding using HuggingFace
Now the most convenient way to do NLP projects is through HuggingFacetransformersThe library calls the pretrained model. With a few lines of code, you can get high-quality word embeddings output by models such as BERT.
Dynamic word vectors allow NLP models to truly begin to understand "polysemy", greatly raising the ceiling for various tasks.
Practical applications and cases
Having talked about so many principles, let’s take a look at how word vectors are used in actual projects. This section uses two classic scenarios to demonstrate: text classification and similarity calculation.
Word vector application in text classification
Text classification is an entry-level task of NLP. It is required for common sentiment analysis, spam detection, and news classification. Text representation constructed with word vectors often performs far better than the traditional bag-of-words model.
This code shows the simplest process: first average the word vectors in each sentence to obtain the sentence vector, and then send it to the logistic regression classifier. Even with such a simple approach, good baseline scores can be obtained in many real scenarios.
Word vector similarity calculation
Another high-frequency application of word vectors is to calculate semantic similarity, which is used in recommendation systems, synonym search, information retrieval, etc.
In actual projects, just replace the random vectors with Word2Vec or GloVe's pre-trained vectors, and you can quickly build a semantic retrieval or related recommendation function.
Related tutorials
Summarize
Word vector technology is the core foundation of natural language processing, which successfully converts discrete words into continuous, semantic-rich vector representations:
- Evolution: From One-Hot sparse representation to Word2Vec/GloVe’s dense representation, to context-sensitive representation of models such as BERT.
- Core Technology: Word2Vec’s Skip-Gram/CBOW model, GloVe’s global statistics, FastText’s character n-gram.
- Practical Application: Various NLP tasks such as text classification, similarity calculation, and information retrieval can directly benefit from high-quality word vectors.
- Modern Practice: Prioritize using frameworks such as HuggingFace to load pre-trained models, and select appropriate word embedding solutions based on task type and resource conditions.
💡 Core Point: The quality of word vectors directly affects the performance of downstream NLP tasks. In modern NLP tasks, it is recommended to use the hidden states of pre-trained Transformer models as word embeddings to obtain context-aware representations.
🔗 Extended reading
📂 Stage: Stage 1 - Text Preprocessing (Cornerstone) 🔗 Related chapters: 文本特征工程TF-IDF与相似度 · 分词技术Tokenization

