title: Named Entity Recognition (NER): A complete guide to accurately extract names of people, places, and organizations from text | Daoman PythonAI description: In-depth study of named entity recognition technology, master BIO annotation method, sequence annotation method and BERT-based NER implementation. Covers the complete development process of Chinese NER model, data preprocessing, model fine-tuning and practical application. keywords: [Named entity recognition, NER, sequence annotation, BIO annotation method, BERT, entity extraction, natural language processing, deep learning, Chinese NER, entity recognition]

Named Entity Recognition (NER): A complete guide to accurately extracting key information from text

Table of contents


NER Overview and Application

Named Entity Recognition (NER) is the core basic task of natural language processing. Its goal is, from a piece of free text: 1. Find out exactly which words belong to "entity" 2. Group these entities into **predefined categories**

In other words, NER allows machines to understand key information such as "who, where, what organization, when, and how much" in the text.

Common predefined entity types

Label abbreviationFull English nameChinese descriptionTypical examples
PERPersonName entityXiao Ming, Lei Jun, Tu Youyou
ORGOrganizationOrganizational StructureTsinghua University, ByteDance, United Nations
LOCLocationBeijing, Mount Everest, Pacific Ocean
TIMETimeTime expression2024, yesterday, 10 a.m.
MONEYMoneyMonetary amount100 yuan, $999, five million
PRODProductProduct NameiPhone 16, Huawei Mate XT, Tesla Model Y

These categories can be flexibly expanded according to business needs. For example, in the medical scenario, there will be entities such as "symptoms" and "drugs", and in the financial scenario, there will be "company", "amount", "date", etc.

Core application scenarios

NER is often the first step in information extraction, paving the way for more complex applications:

FieldSpecific useImplementation example
Search engineKeyword highlighting, entity linksWhen searching for "Ma Huateng", affiliated companies such as "Tencent" will also be highlighted
Intelligent customer serviceIntent understanding, slot fillingIn "Query the refund progress of order AB123456", identify the order number and operation type
Financial risk controlRisk entity identificationExtract gang-related companies and suspicious amounts from transaction texts
HealthcareMedical entity extractionExtract symptoms, drugs, and diagnostic results from medical records

Now that we understand what it can do, let's look at how NER is technically implemented.


Detailed explanation of BIO annotation method

NER is often modeled as a sequence labeling problem: given a sequence of characters (or words), give each position a label. The BIO annotation method is the most common and easiest to implement annotation method in this field.

Core annotation rules

Tag prefixMeaningDescription
B-Beginning of entityFirst word/character of entity
I-Inside the entityThe following word/character of the entity
ONon-entityDoes not belong to any predefined entity

For example, if the sentence "Xiao Ming studies at Peking University" is annotated by character level, the result will be as follows:

原文本:  小  明  在  北  京  大  学  读  书
字符索引: 0  1  2  3  4  5  6  7  8
BIO标签:  B-PER I-PER O B-ORG I-ORG I-ORG I-ORG O O
恢复实体: 小明(PER)、北京大学(ORG)

**Why is Chinese usually annotated by characters instead of words? ** Because the word segmentation itself may be wrong, if "Peking University" is mistakenly segmented into "Beijing" and "University", it will be difficult to align the boundaries of the entities. Character-based annotation can avoid this error propagation and let the model directly learn the entity boundaries between characters.

If finer boundary control is required, **BIOES annotation** can be used. It adds two tags based on BIO: - **S-xxx**: A separate entity, such as "Shenzhen" is represented by only one label S-LOC - **E-xxx**: Entity ending, for example, the "study" symbol of "Peking University" is E-ORG

In actual projects, BIO can already cover most scenarios. Consider upgrading to BIOES again when you find that your model frequently gets the boundaries of entities wrong.


Chinese NER data set and preprocessing

Commonly used public Chinese data sets

To train your own NER model, you need labeled data. Here are several high-quality, free Chinese NER datasets:

Dataset nameData scaleEntity typeSource fieldDownload channel
MSRA~50,000 sentencesPER/ORG/LOCNewsCLUE Benchmark
OntoNotes 4~16,000 sentencesPER/ORG/LOC/MISCMultiple fieldsCoNLL 2012
Weibo~13,000 sentencesPER/ORG/LOC/TIMESocial MediaLTP Toolkit
Resume~17,000 sentencesNAME/ORG/RACE and other 8 categoriesResumeHarbin Institute of Technology

Pay attention when choosing a data set: if you are doing social media analysis, using the Weibo data set is closer to the real scene; if you are doing resume analysis, the Resume data set is very suitable.

Core preprocessing: text and entity conversion

After getting the data, the most common preprocessing operation is to convert between original text and BIO tag sequence. The following is a general processing tool class:

class NERDataProcessor:
    """
    中文NER预处理器:字符级BIO标注与实体恢复
    """
    def __init__(self, entity_types: list = None):
        if entity_types is None:
            entity_types = ["PER", "ORG", "LOC"]
        self.entity_types = entity_types

    def text_to_bios(self, text: str, entities: list) -> tuple:
        """
        文本转字符级BIO标签
        entities格式:[(start_char, end_char, label), ...]
        """
        char_labels = ["O"] * len(text)
        # 按起始位置排序,防止实体覆盖
        for s, e, l in sorted(entities, key=lambda x: x[0]):
            if 0 <= s < e <= len(text) and l in self.entity_types:
                char_labels[s] = f"B-{l}"
                for i in range(s + 1, e):
                    char_labels[i] = f"I-{l}"
        return list(text), char_labels

    def bios_to_entities(self, chars: list, labels: list) -> list:
        """
        从字符级BIO标签恢复完整实体
        返回格式:[(实体文本, 标签, 起始位置, 结束位置), ...]
        """
        entities, curr = [], None
        for i, (c, l) in enumerate(zip(chars, labels)):
            if l.startswith("B-"):
                if curr:
                    entities.append(("".join(curr[0]), curr[1], curr[2], i))
                curr = ([c], l[2:], i)
            elif l.startswith("I-") and curr and curr[1] == l[2:]:
                curr[0].append(c)
            else:
                if curr:
                    entities.append(("".join(curr[0]), curr[1], curr[2], i))
                curr = None
        if curr:
            entities.append(("".join(curr[0]), curr[1], curr[2], len(chars)))
        return entities

This class does two things:

  1. text_to_bios: given text and entity interval (for example[(0,2,"PER")]), generate a label for each character.
  2. bios_to_entities: Re-"splice" the entities from the character sequence and label sequence, which is very useful when restoring the results after model prediction.

You can use it to quickly view the labeling effect of training samples, or convert labels into human-readable entities after model inference.


NER implementation based on BERT

The most mainstream approach today is to complete NER based on pre-trained language models (such as BERT). The good news is that there are already many Chinese models in the community that are pre-trained and fine-tuned to NER tasks, and you can use them directly without even training yourself.

Use ready-made models for rapid inference

Hugging FacetransformersThe library provides a very convenientpipelineinterface:

from transformers import pipeline

# 推荐的中文开源模型:ckiplab/bert-base-chinese-ner
ner_pipeline = pipeline(
    task="token-classification",
    model="ckiplab/bert-base-chinese-ner",
    aggregation_strategy="simple"  # 自动聚合相邻的同类型实体
)

# 推理示例
text = "雷军在北京小米科技园发布了小米15"
results = ner_pipeline(text)

# 格式化输出
print("识别结果:")
for res in results:
    print(f"[{res['entity_group']}] {res['word']} (置信度:{res['score']:.2f})")

The output may be similar to:

识别结果:
[PER] 雷军 (置信度:0.99)
[LOC] 北京 (置信度:0.98)
[ORG] 小米科技园 (置信度:0.96)
[PROD] 小米15 (置信度:0.94)
There are also NER models in the model library specifically for different fields, such as news, social media, medical, etc. If the open source model doesn’t perform well enough on your data, consider fine-tuning it with your own data.

Model fine-tuning and evaluation

Although it is convenient to use the ready-made model directly, if the entity types in your business are very special (such as internal product code names, industry-specific terminology), fine-tuning is essential.

Fine-tuning the core process (lite version)

Fine-tuning BERT to do NER usually involves several key steps:

  1. Data alignment: Map your character-level BIO tags to BERT's token level. BERT may split a character into multiple sub-words (tokens). In this case, the labels need to be processed appropriately (for example, setting the labels of the sub-words to the I-label of the same entity).
  2. Model Adaptation: Add a classification head based on the pre-trained BERT to map the output of the hidden layer tonum_labelson a category.
  3. Training and Evaluation: Train using cross-entropy loss and evaluate with entity-level F1 scores.

Entity-level evaluation indicators

When evaluating a NER model, you can't just look at token-level accuracy. Because most tokens are 'O' (non-entity), the model can achieve a high accuracy even if all predictions are 'O', but not a single entity is recognized. So we have to use entity-level F1 score, which requires that the boundaries and types of entities are completely correct to be successful in prediction.

useseqevalThe library makes it easy to do entity-level evaluation:

from seqeval.metrics import f1_score, classification_report

# 示例:真实标签与预测标签(token级序列)
y_true = [["B-PER", "I-PER", "O", "B-ORG", "I-ORG", "I-ORG"]]
y_pred = [["B-PER", "O",     "O", "B-ORG", "I-ORG", "I-ORG"]]

# 实体级评估
print(f"实体级F1分数:{f1_score(y_true, y_pred):.2f}")
print("\n分类报告:")
print(classification_report(y_true, y_pred))

In this example, the prediction mislabeled the "明" of "Xiao Ming" as O, resulting in the entity "Xiao Ming" not being fully recalled, and the F1 score will decrease.classification_reportThe accuracy, recall and F1 of each entity category will also be given in detail to help you locate the problem.


Practical application cases

E-commerce customer service intention slot extraction

The most common implementation form of NER in the dialogue system is slot filling. For example, an e-commerce customer service inquiry: "Help me check the refund for order number AB123456, paid yesterday for 999 yuan." We need to extract the product, order number, time, amount and other information.

The following is a simplified implementation framework:

class EcomNER:
    def __init__(self, model_name="your-finetuned-ecom-ner"):
        # 加载你自己微调过的电商NER模型
        self.ner = pipeline(
            "token-classification",
            model=model_name,
            aggregation_strategy="simple"
        )

    def extract_slots(self, query: str) -> dict:
        """
        从用户查询中提取电商核心槽位
        """
        slots = {"product": [], "order_id": [], "time": [], "amount": []}
        results = self.ner(query)
        for res in results:
            eg = res["entity_group"].lower()
            if eg in slots:
                slots[eg].append(res["word"])
        return slots

# 演示
ecom_ner = EcomNER()
query = "帮我查下订单号AB123456的退款,昨天付的999元"
print("用户查询:", query)
print("提取槽位:", ecom_ner.extract_slots(query))

If the model is trained properly, the output will look like:

用户查询: 帮我查下订单号AB123456的退款,昨天付的999元
提取槽位: {'product': [], 'order_id': ['AB123456'], 'time': ['昨天'], 'amount': ['999元']}

After obtaining these structured slots, the customer service system can directly go to the backend to query the order, matching amount and time, greatly improving the automatic processing rate.


1. First understand the **BIO annotation method** and character-level preprocessing, which are the basis of all work. 2. Use the open source model on Hugging Face Hub to directly run through the inference and get an intuitive feeling. 3. Use a small amount of labeled data to perform fine-tuning in your scenario, and compare the effects before and after fine-tuning. 4. In actual projects, priority is given to ensuring **data annotation quality**, which is more important than adjusting the model structure.

🔗 Extended reading

📂 Stage: Stage 4 - Pre-training model and transfer learning (application)