title: Named Entity Recognition (NER): A complete guide to accurately extract names of people, places, and organizations from text | Daoman PythonAI description: In-depth study of named entity recognition technology, master BIO annotation method, sequence annotation method and BERT-based NER implementation. Covers the complete development process of Chinese NER model, data preprocessing, model fine-tuning and practical application. keywords: [Named entity recognition, NER, sequence annotation, BIO annotation method, BERT, entity extraction, natural language processing, deep learning, Chinese NER, entity recognition]
Named Entity Recognition (NER): A complete guide to accurately extracting key information from text
Table of contents
NER Overview and Application
In other words, NER allows machines to understand key information such as "who, where, what organization, when, and how much" in the text.
Common predefined entity types
These categories can be flexibly expanded according to business needs. For example, in the medical scenario, there will be entities such as "symptoms" and "drugs", and in the financial scenario, there will be "company", "amount", "date", etc.
Core application scenarios
NER is often the first step in information extraction, paving the way for more complex applications:
Now that we understand what it can do, let's look at how NER is technically implemented.
Detailed explanation of BIO annotation method
NER is often modeled as a sequence labeling problem: given a sequence of characters (or words), give each position a label. The BIO annotation method is the most common and easiest to implement annotation method in this field.
Core annotation rules
For example, if the sentence "Xiao Ming studies at Peking University" is annotated by character level, the result will be as follows:
**Why is Chinese usually annotated by characters instead of words? ** Because the word segmentation itself may be wrong, if "Peking University" is mistakenly segmented into "Beijing" and "University", it will be difficult to align the boundaries of the entities. Character-based annotation can avoid this error propagation and let the model directly learn the entity boundaries between characters.
In actual projects, BIO can already cover most scenarios. Consider upgrading to BIOES again when you find that your model frequently gets the boundaries of entities wrong.
Chinese NER data set and preprocessing
Commonly used public Chinese data sets
To train your own NER model, you need labeled data. Here are several high-quality, free Chinese NER datasets:
Pay attention when choosing a data set: if you are doing social media analysis, using the Weibo data set is closer to the real scene; if you are doing resume analysis, the Resume data set is very suitable.
Core preprocessing: text and entity conversion
After getting the data, the most common preprocessing operation is to convert between original text and BIO tag sequence. The following is a general processing tool class:
This class does two things:
- text_to_bios: given text and entity interval (for example
[(0,2,"PER")]), generate a label for each character. - bios_to_entities: Re-"splice" the entities from the character sequence and label sequence, which is very useful when restoring the results after model prediction.
You can use it to quickly view the labeling effect of training samples, or convert labels into human-readable entities after model inference.
NER implementation based on BERT
The most mainstream approach today is to complete NER based on pre-trained language models (such as BERT). The good news is that there are already many Chinese models in the community that are pre-trained and fine-tuned to NER tasks, and you can use them directly without even training yourself.
Use ready-made models for rapid inference
Hugging FacetransformersThe library provides a very convenientpipelineinterface:
The output may be similar to:
Model fine-tuning and evaluation
Although it is convenient to use the ready-made model directly, if the entity types in your business are very special (such as internal product code names, industry-specific terminology), fine-tuning is essential.
Fine-tuning the core process (lite version)
Fine-tuning BERT to do NER usually involves several key steps:
- Data alignment: Map your character-level BIO tags to BERT's token level. BERT may split a character into multiple sub-words (tokens). In this case, the labels need to be processed appropriately (for example, setting the labels of the sub-words to the I-label of the same entity).
- Model Adaptation: Add a classification head based on the pre-trained BERT to map the output of the hidden layer to
num_labelson a category. - Training and Evaluation: Train using cross-entropy loss and evaluate with entity-level F1 scores.
Entity-level evaluation indicators
When evaluating a NER model, you can't just look at token-level accuracy. Because most tokens are 'O' (non-entity), the model can achieve a high accuracy even if all predictions are 'O', but not a single entity is recognized. So we have to use entity-level F1 score, which requires that the boundaries and types of entities are completely correct to be successful in prediction.
useseqevalThe library makes it easy to do entity-level evaluation:
In this example, the prediction mislabeled the "明" of "Xiao Ming" as O, resulting in the entity "Xiao Ming" not being fully recalled, and the F1 score will decrease.classification_reportThe accuracy, recall and F1 of each entity category will also be given in detail to help you locate the problem.
Practical application cases
E-commerce customer service intention slot extraction
The most common implementation form of NER in the dialogue system is slot filling. For example, an e-commerce customer service inquiry: "Help me check the refund for order number AB123456, paid yesterday for 999 yuan." We need to extract the product, order number, time, amount and other information.
The following is a simplified implementation framework:
If the model is trained properly, the output will look like:
After obtaining these structured slots, the customer service system can directly go to the backend to query the order, matching amount and time, greatly improving the automatic processing rate.
Related tutorials
🔗 Extended reading
📂 Stage: Stage 4 - Pre-training model and transfer learning (application)

