Build a search engine using Elasticsearch

Many people's first impression of Elasticsearch is a "search artifact", but it is essentially a distributed real-time document storage + analysis engine optimized based on Apache Lucene - storage is only the foundation, and retrieval is the core advantage. This article will help you get started quickly and use Python client to build a basic Chinese full-text search system prototype.


1. First clarify the core logic: analogy relational-database

Many novices are easily confused by the bunch of proper nouns in ES. Here, we will first use the familiar database concepts to make a simple but sufficient correspondence (note that they are not strictly equivalent):

relational-database structureElasticsearch correspondencebrief description
Database (Database)Index (Index)The top-level management container of data, must be all lowercase
Table (Table)Type has been completely deprecated in 7.x+, and now an Index only stores one type of document
RowDocumentThe core data unit of JSON format is a record
ColumnFieldKey-value pairs in JSON, supporting multiple types such as text, numeric values, dates, etc.
Primary KeyDocument IDCan be automatically generated or specified by the business party (such as order number, news number)

To put it simply, you can think of an Index as a table that stores a bunch of documents in JSON format, and each document is a row of data. The next operations will revolve around Index and Document.


2. Set up a local test environment in 5 minutes

There is no need to struggle with complex cluster configurations. Single-node Docker containers are the fastest way for newbies to get started.

2.1 Pull and start ES

# 拉取 8.12.0 版本(如果版本更新,后面的 IK 分词器也要对应)
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.12.0

# 启动单节点(关闭集群发现、禁用安全配置方便测试,生产环境绝对不能这么做!)
docker run -d \
  -p 9200:9200 \
  -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "xpack.security.enrollment.enabled=false" \
  docker.elastic.co/elasticsearch/elasticsearch:8.12.0

Wait about 30 seconds after startup and accesshttp://localhost:9200, if cluster information similar to JSON is returned, it means that ES is already running.

2.2 Install Python client

Officially recommendedelasticsearch-py7.x+ supports both ES 7.x and 8.x series. You can install it directly with pip:

pip install elasticsearch==8.12.0  # 建议和 ES 版本号一致,避免兼容性问题

3. Preparation: Index definition (Mapping)

ES can automatically infer the field type based on the first document you insert. However, in Chinese full-text retrieval scenarios, it is best to manually specify Mapping - for example, add a Chinese word segmenter to the text field, and set the fields such as URL and status code that do not require word segmentation tokeywordtype to avoid being "broken into pieces".

Here we use News Search as the scenario, and first install the IK word segmenter (by default, ES can only segment English words, and Chinese words will be split into individual Chinese characters):

# 先找到容器 ID:docker ps
docker exec -it <你的容器ID> bin/elasticsearch-plugin install \
  https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.12.0/elasticsearch-analysis-ik-8.12.0.zip

# 安装后重启容器
docker restart <你的容器ID>

After restarting, create the index using Python:

from elasticsearch import Elasticsearch

# 连接本地 ES
es = Elasticsearch("http://localhost:9200")
if not es.ping():
    raise ConnectionError("无法连接到 Elasticsearch,请检查容器是否正常运行!")

# 定义索引 Mapping
news_mapping = {
    "mappings": {
        "properties": {
            "title": {  # 新闻标题:需要细粒度分词,支持“中国高考”“高考政策”等组合搜索
                "type": "text",
                "analyzer": "ik_max_word",       # 索引时最细粒度分词
                "search_analyzer": "ik_smart"    # 搜索时粗粒度分词
            },
            "content": {  # 新闻内容:同上
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_smart"
            },
            "url": {  # 新闻链接:不需要分词,精准匹配即可
                "type": "keyword"
            },
            "publish_date": {  # 发布时间:需要范围过滤
                "type": "date"
            },
            "category": {  # 分类:精准匹配
                "type": "keyword"
            }
        }
    }
}

# 创建索引,ignore=400 表示如果索引已存在就跳过
response = es.indices.create(index="chinese_news", body=news_mapping, ignore=400)
print("创建索引结果:", response)

ik_max_wordThe text will be split into the most granular words (for example, "Chinese students" will be split into "China, junior high school, students, Chinese students, Chinese students"), and the index content will be more, but it will be helpful for recall;ik_smartThen perform coarse-grained word segmentation (for example, split into "China, student") to make the search more accurate. This combination of one coarse and one fine can take into account both the recall rate and the precision rate.


4. Core functions in practice

4.1 Document Operation

ES supports both single and batch operations, and is preferred for production environments.bulkAPI batch insert/update/delete**, the performance is ten times or even dozens of times faster than the item-by-item operation.

Single operation

# 1. 指定 ID 插入(ID 已存在会报错)
doc1 = {
    "title": "2024年高考报名时间公布:多省提前启动",
    "content": "近日,教育部发布通知,2024年全国普通高等学校招生统一考试报名工作将在部分省份提前启动...",
    "url": "https://example.com/news/20240520/1",
    "publish_date": "2024-05-20",
    "category": "教育"
}
res = es.create(index="chinese_news", id="20240520_1", body=doc1)
print("指定ID插入结果:", res["result"])  # 成功返回 created

# 2. 自动生成 ID 插入(幂等性较差,不建议用于核心数据)
doc2 = doc1.copy()
doc2["url"] = "https://example.com/news/20240520/2"
res = es.index(index="chinese_news", body=doc2)
print("自动ID插入结果:", res["result"], res["_id"])

# 3. 更新文档(部分更新只需传要改的字段)
update_body = {
    "doc": {
        "category": "高考政策"
    }
}
res = es.update(index="chinese_news", id="20240520_1", body=update_body)
print("部分更新结果:", res["result"])  # 成功返回 updated

# 4. 删除文档
res = es.delete(index="chinese_news", id=res["_id"])  # 删除刚才自动生成的那条
print("删除结果:", res["result"])
  • createAn ID is required and cannot be repeated.
  • indexExisting documents will be overwritten, making personal testing more flexible, but be careful in the production environment.
  • updateSupports "partial update", which only modifies specified fields without losing other data.

4.2 Search function

Search is the soul of ES. Here are the 3 most commonly used scenarios.

Scenario 1: Simple full-text matching

Search for "2024 College Entrance Examination Registration", ES will return documents containing relevant keywords in the title or content, and sort them by relevance scores:

simple_query = {
    "query": {
        "multi_match": {          # 多字段匹配
            "query": "2024高考报名",
            "fields": ["title^3", "content"]   # title 的权重是 content 的 3 倍
        }
    },
    "size": 10,                  # 只返回前 10 条(默认也是 10)
    "_source": ["title", "url", "publish_date", "category"]  # 只返回需要的字段
}

res = es.search(index="chinese_news", body=simple_query)
print("搜索到的文档数:", res["hits"]["total"]["value"])
for hit in res["hits"]["hits"]:
    print(f"标题:{hit['_source']['title']} | 相关性分数:{hit['_score']}")

multi_matchIt will automatically search multiple fields.^3It means to increase the weight of the title to 3 times, so that documents with more matching titles will be ranked higher.


Scenario 2: Boolean query with conditions

Boolean query is the most flexible query method in ES and supportsmust(Must be met to participate in scoring),must_not(Must not be included, it will not affect the score),should(Extra points will be awarded if satisfied),filter(Must be met, does not participate in scoring but will be cached, has the best performance).

Suppose we are looking for: News released in 2023-2024, classified as 'College Entrance Examination Policy' or 'Education', with titles or content containing 'Registration' but not 'Adult College Entrance Examination', and sorted in reverse order of release time:

bool_query = {
    "query": {
        "bool": {
            "must": [
                {"multi_match": {"query": "报名", "fields": ["title^3", "content"]}}
            ],
            "must_not": [
                {"match": {"title": "成人高考"}}
            ],
            "should": [
                {"term": {"category": "高考政策"}},  # term 用于 keyword 类型精准匹配
                {"term": {"category": "教育"}}
            ],
            "filter": [
                {"range": {"publish_date": {"gte": "2023-01-01", "lte": "2024-12-31"}}}
            ],
            "minimum_should_match": 1  # should 里至少满足 1 个
        }
    },
    "sort": [
        {"publish_date": {"order": "desc"}},   # 先按发布时间倒序
        {"_score": {"order": "desc"}}          # 再按相关性分数倒序
    ],
    "size": 10,
    "_source": ["title", "url", "publish_date", "category"]
}

res = es.search(index="chinese_news", body=bool_query)
print("过滤后的搜索结果:")
for hit in res["hits"]["hits"]:
    print(f"{hit['_source']['publish_date']} | {hit['_source']['category']} | {hit['_source']['title']}")

filterIt will not affect the score of the document and can be automatically cached by ES. It is suitable for filtering based on conditions such as time range and fixed status.


5. 3 best practices that newbies must read

  1. Index design requires advance planning Once Mapping is created, in addition to adding new fields, modifications to other types or tokenizers require rebuilding the index. It is recommended to use Alias to manage indexes: During reconstruction, create a new index in the background and switch back to the alias, without any changes to the business code.

  2. Performance optimization starts with details

  • To avoid a single document that is too large (>100 MB), consider splitting it into multiple smaller documents;
  • When operating in batches, eachbulkIt is recommended that the request size be controlled within 5–15 MB; -Multipurposefilterrather thanmust, Make full use of ES’s caching mechanism.
  1. Security configuration must be enabled in the production environment This article is closed for quick testingxpack.security, the production environment must enable HTTPS, username and password authentication, and RBAC access control to avoid data streaking.

6. Summary and expansion

This article takes you through the core concepts of ES, environment-setup, index design, document operations and common search scenarios. With just a few lines of code in Python, you can run a basic Chinese full-text search prototype. In actual projects, you can also combine highlighting, aggregate analysis, page navigation and other functions to further enrich the search experience.

If you want to explore further, we recommend the following resources:

I wish you can use ES to build a fast and accurate search engine!