📘 The Complete Guide to Natural Language Processing (NLP)

Table of contents


What is NLP?

Natural Language Processing (NLP) is the intersection of artificial intelligence and linguistics. The problem it wants to solve is very straightforward: Let computers truly "understand human speech".

Note that this is not just about converting speech into text or recognizing words in pictures - the core challenge of NLP is: crossing the "machine gap" behind language symbols. A qualified NLP system needs to be able to:

  • Understand the context and know that the same word has different meanings in different sentences;
  • Understand subtext, identify irony, puns, and tactful rejection;
  • Grasp the emotional tendency and judge whether the passage is happy, angry or neutral;
  • What's more, generate responses in smooth, coherent language like a human, or even write a complete article.

You actually deal with NLP every day, but you may not realize it:

  • Typing typos in the search engine, it will still give accurate results - behind it is spelling correction and fuzzy matching;
  • Say to the mobile assistant "Will it rain in Beijing tomorrow?" and it will directly tell you the weather forecast - this step is to convert natural language into a structured API query;
  • The automatic subtitles you see when browsing short videos and the collapsed malicious messages in the comment area are all NLP working behind the scenes.

Simply put, NLP is the technical bridge from "literacy" to "understanding you".


The evolution history of NLP: from hard-coded rules to "self-taught"

The development of NLP can be described as "three ups and three downs", but a clearer division is into three stages. Each stage has a major upgrade to address the fatal shortcomings of the previous stage.

1. Symbolism: "How humans teach, how machines learn"

Mainstream thinking from the 1950s to the 1980s. Linguists manually write thousands of rules and use if-else and regular expressions to feed "subject, predicate, object" and "tense changes" to the computer. For example: If the sentence ends with "can" and contains "can you", it is a question.

Fatal flaw: Rules can never keep up with the flexible changes in language. New Internet memes, dialects, and the ambiguity of two meanings in a sentence make the rule system collapse at the first touch. Ask it to understand the irony of "You're really good at that" and it's likely to take it seriously.

2. Statistical school: "Digging patterns from massive texts"

From the 1990s to the mid-2010s, the Internet began to accumulate massive amounts of text. People turn to probabilistic models (hidden Markov models, Naive Bayes, SVM, etc.) to "guess" the laws of language. For example, is the word that appears after "I am really ____ today" more likely to be a complimentary word or a derogatory word? Models can be automatically derived from data.

Key shortcomings: It requires a lot of manual work to do Feature Engineering - you have to tell the model to pay attention to the "features" of the word's part of speech, frequency of occurrence, and pre- and post-collocation, and extracting features relies heavily on human experience. To make matters more troubling, statistical models have difficulty handling long-distance dependencies. For example, "The cat jumped on the table, then jumped on the window sill, and finally knocked over the vase. It got into trouble." The "it" here refers to "cat", but there are more than ten words between the two, and the statistical model will most likely guess wrong.

3. Neural network school: "Learn deep rules by yourself from massive texts"

After the emergence of Word2Vec in 2013, NLP officially entered the era of deep learning. The birth of the Transformer architecture in 2017 directly opened a new paradigm: large-scale pre-training + downstream task fine-tuning, which also gave birth to what we often call large language models (LLM) now.

Core Breakthrough:

  • The model can automatically learn deep semantic representations from data, eliminating the need to manually design features;
  • The self-attention mechanism allows the model to see the relationship between all words in the entire sentence at a glance, completely solving the long-standing problem of "long-distance dependence";
  • Transformer's parallel processing capabilities improve training efficiency exponentially, resulting in giant models with hundreds of billions of parameters.

The table below summarizes the key milestones of the past decade or so:

YearMilestone EventCore Impact
2013Word2Vec (Google)Turn words into "meaningful numerical vectors" and become the foundation of modern NLP
2017Transformer (Google Brain)Abandon the sequential processing of RNN/LSTM, introduce parallel computing and self-attention, a double leap in efficiency and performance
2018BERT (Google) / GPT-1 (OpenAI)Establish the paradigm of "first pre-train on large-scale general corpus, and then fine-tune for small tasks", refreshing the list of almost all NLP tasks
2020GPT-3 (OpenAI)Parameters exceeded 175 billion, demonstrating "emergent capabilities", such as contextual learning and simple reasoning
2022ChatGPT (OpenAI)Using RLHF (Reinforcement Learning with Human Feedback) to greatly improve the conversation experience, LLM truly enters daily life

Core tasks and common tools of NLP

NLP covers a wide range of directions, but the most common tasks in actual development can be classified into the following categories. Whether you are doing academic experiments or engineering implementation, this classification can help you quickly determine which tools to use.

Task categoriesSubdivided tasksTypical application scenariosCommon tools/libraries
Text preprocessingWord segmentation, part-of-speech tagging (POS), stop word removal, text cleaningThe first step in all NLP projects, turning raw text into clean inputspaCy, NLTK, Jieba (Chinese), Hugging Face Tokenizers
Text UnderstandingNamed entity recognition (NER), syntactic parsing, semantic similarity, intent recognitionSearch engine keyword matching, intelligent customer service intent parsing, automatic screening of resumes, knowledge base Q&A pre-processingspaCy NER, BERT-NER, Sentence-BERT, Rasa NLU
Text AnalysisSentiment analysis, topic modeling, text summaryE-commerce review monitoring, news hot spot tracking, meeting minutes generationTextBlob, VADER, BERT, LDA, GPT-3.5/4 Turbo short summary
Text generationDialogue generation, machine translation, content creation, code completionLarge model chat, cross-border e-commerce product translation, public account outline generation, GitHub CopilotGPT-4o, Claude 3, T5, BART, DeepL API

Give it a try: Use Jieba to process Chinese word segmentation

Text preprocessing is the most user-friendly first step in getting started with NLP. The following is an example of using Jieba to quickly experience Chinese word segmentation:

# 先安装库:pip install jieba
import jieba

text = "道满PythonAI是一个专注于自然语言处理和大模型的技术博客"

# 1. 精确模式:最常用的模式,适合文本分析,尽量把句子最精准地切开
seg_exact = jieba.cut(text, cut_all=False)
print("精确模式:" + "/".join(seg_exact))

# 2. 全模式:把所有可能的词都扫描出来,优点是快,缺点是有冗余
seg_full = jieba.cut(text, cut_all=True)
print("全模式:" + "/".join(seg_full))

# 3. 搜索引擎模式:在精确模式基础上,对较长词再切分,提升召回率,适合搜索场景
seg_search = jieba.cut_for_search(text)
print("搜索引擎模式:" + "/".join(seg_search))

The output you get will look something like this:

精确模式:道满/PythonAI/是/一个/专注于/自然语言处理/和/大模型/的/技术/博客
全模式:道满/PythonAI/是/一个/专注/专注于/自然/自然语言/语言/自然语言处理/处理/和/大模型/模型/的/技术/博客
搜索引擎模式:道满/PythonAI/是/一个/专注/专注于/自然/语言/处理/自然语言处理/和/模型/大模型/的/技术/博客

It can be seen that for the same sentence, the "granularity" given by different modes is completely different, and you can choose it flexibly according to your own tasks.


Key technical concepts: large models understand the password of human speech

You don’t need to dig through complex mathematical derivation. As long as you understand the following four core concepts, you can uncover the secrets of large models.

1. Word Embeddings: Draw a semantic coordinate for each word

Computers only recognize numbers and cannot directly process words like "cat" and "dog". The method of word vector is to assign a fixed-length number to each word - you can think of it as marking each word with a coordinate on a "semantic map".

On this map:

  • Words with similar meanings, such as "Beijing" and "Shanghai", will be very close;
  • Words with opposite meanings, such as "good" and "bad", will stay far apart;
  • What's even more amazing is that word vectors can also do simple "semantic addition and subtraction": the classic example is国王 - 男人 + 女人 ≈ 王后

Commonly used tools include Word2Vec, GloVe, and Sentence-BERT, which can map entire sentences into vectors. They are all basic components of modern NLP.

2. Self-Attention mechanism: Let the model see the global key points at a glance

This is the core innovation of Transformer and one of the sources of large model capabilities.

Imagine you read this sentence: "He placed the heavy encyclopedia on the oak table by the window because it contained so much knowledge." Humans can tell almost immediately that "it" refers to "encyclopedia." However, traditional RNN/LSTM processes word by word. When you read "it", the "encyclopedia" is already a distant memory, and it is easy to make reference errors.

The self-attention mechanism is like giving the model the ability to "read the local part and see the whole world": When processing each word, the model will scan all words in the entire sentence at the same time and calculate the correlation between each word and the current word. When processing the word "it", the model will automatically find that the words "thick", "encyclopedia" and "carrying knowledge" have the highest correlation with it, so it can accurately establish the connection.

3. Transformer architecture: a language engine that does not queue and can be processed in parallel

Transformer is completely built based on the attention mechanism, and there is no longer the order restriction of RNN/LSTM that "you must wait for the previous word to be calculated before the next word can be calculated." It can process all words of the entire sentence at the same time, and the training speed is increased by dozens or even hundreds of times. This also explains why we are now able to train very large models with hundreds of billions of parameters.

Transformer mainly consists of two parts:

  • Encoder: Responsible for reading input text and extracting meaning;
  • Decoder: Responsible for generating target text based on meaning.

Different classic models actually make different "cuts" of these two parts:

  • BERT: only uses encoders, specializing in text understanding, such as classification, entity recognition, and question and answer;
  • GPT: only uses decoders, specializing in text generation, such as dialogue, writing articles, and translation;
  • T5, BART: Use both, and are more versatile. Treat all NLP tasks as "input a piece of text, output a piece of text".

4. Pre-training + fine-tuning: learn general knowledge first, then practice expertise

This idea is particularly similar to people's learning path: if you want to write legal documents, you must first learn primary school Chinese and modern Chinese, and develop your ability to express yourself fluently, and then memorize legal provisions and write professional documents.

The same goes for large models:

  • Pre-training: Let the model learn to "speak and read human language" on general text (web pages, books, Wikipedia, etc.) with hundreds of billions or even trillions of tokens. This stage is extremely expensive. For example, the single pre-training cost of GPT-3 is estimated to be more than $4.6 million;
  • Fine-tuning: Use a small amount of task-specific data (such as thousands of product reviews marked with emotional polarity) to slightly adjust the pre-trained model. This stage is light and cheap and can be done with a regular computer or Colab.

Later, LoRA (Low Rank Adaptation), Prompt Engineering (Prompt Engineering), RAG (Retrieval Enhancement Generation) and other technologies appeared. In many scenarios, even fine-tuning is omitted and pre-trained large models are directly used to perform tasks.


Modern NLP implementation examples

Use Hugging Face to quickly implement Chinese sentiment analysis

You don’t need to build complex neural networks from scratch, Hugging Face’spipelineTools have packaged pre-trained models into functions that can be run with one line of code. Below we usetransformersDo sentiment analysis:

# 先安装依赖:pip install transformers torch
from transformers import pipeline

# 加载一个专为中文情感分析设计的预训练模型
classifier = pipeline(
    "text-classification",
    model="uer/roberta-base-chinese-extractive-qa-sentiment"
)

comments = [
    "这件衣服质量超级好,版型也很显瘦,客服态度也很棒!",
    "发货太慢了,等了整整一周才到,而且颜色和图片差很多,不推荐购买。",
    "包装还行,就是快递员直接扔在小区门口了,不太满意。"
]

for i, comment in enumerate(comments):
    result = classifier(comment)[0]
    label = result['label']
    score = result['score']
    print(f"评论{i+1}{comment}")
    print(f"情感倾向:{label},置信度:{score:.2f}\n")

The output of this code will tell you whether each review is positive or negative, and how confident the model is about that judgment. You can use it to quickly build your own product reputation monitoring gadget.


Study suggestions

1. **Lay the foundation**: First learn the basics of Python programming, then get started with NLP text preprocessing (word segmentation, cleaning), and try running tools such as Jieba and spaCy. 2. **Understand the principles**: There is no need to read obscure papers from the beginning. Start with our basic tutorials such as "Transformer Complete Architecture" and "Word Vector Space" to understand the core concepts. 3. **Multiple Practices**: Use Hugging Face’s pipeline to quickly create several small projects (text classification, simple chatbots), and after receiving positive feedback, gradually learn advanced gameplay such as fine-tuning, Prompt Engineering, and RAG.