Build a search engine using Elasticsearch
Many people's first impression of Elasticsearch is a "search artifact", but it is essentially a distributed real-time document storage + analysis engine optimized based on Apache Lucene - storage is only the foundation, and retrieval is the core advantage. This article will help you get started quickly and use Python client to build a basic Chinese full-text search system prototype.
1. First clarify the core logic: analogy relational-database
Many novices are easily confused by the bunch of proper nouns in ES. Here, we will first use the familiar database concepts to make a simple but sufficient correspondence (note that they are not strictly equivalent):
To put it simply, you can think of an Index as a table that stores a bunch of documents in JSON format, and each document is a row of data. The next operations will revolve around Index and Document.
2. Set up a local test environment in 5 minutes
There is no need to struggle with complex cluster configurations. Single-node Docker containers are the fastest way for newbies to get started.
2.1 Pull and start ES
Wait about 30 seconds after startup and accesshttp://localhost:9200, if cluster information similar to JSON is returned, it means that ES is already running.
2.2 Install Python client
Officially recommendedelasticsearch-py7.x+ supports both ES 7.x and 8.x series. You can install it directly with pip:
3. Preparation: Index definition (Mapping)
ES can automatically infer the field type based on the first document you insert. However, in Chinese full-text retrieval scenarios, it is best to manually specify Mapping - for example, add a Chinese word segmenter to the text field, and set the fields such as URL and status code that do not require word segmentation tokeywordtype to avoid being "broken into pieces".
Here we use News Search as the scenario, and first install the IK word segmenter (by default, ES can only segment English words, and Chinese words will be split into individual Chinese characters):
After restarting, create the index using Python:
ik_max_wordThe text will be split into the most granular words (for example, "Chinese students" will be split into "China, junior high school, students, Chinese students, Chinese students"), and the index content will be more, but it will be helpful for recall;ik_smartThen perform coarse-grained word segmentation (for example, split into "China, student") to make the search more accurate. This combination of one coarse and one fine can take into account both the recall rate and the precision rate.
4. Core functions in practice
4.1 Document Operation
ES supports both single and batch operations, and is preferred for production environments.bulkAPI batch insert/update/delete**, the performance is ten times or even dozens of times faster than the item-by-item operation.
Single operation
createAn ID is required and cannot be repeated.indexExisting documents will be overwritten, making personal testing more flexible, but be careful in the production environment.updateSupports "partial update", which only modifies specified fields without losing other data.
4.2 Search function
Search is the soul of ES. Here are the 3 most commonly used scenarios.
Scenario 1: Simple full-text matching
Search for "2024 College Entrance Examination Registration", ES will return documents containing relevant keywords in the title or content, and sort them by relevance scores:
multi_matchIt will automatically search multiple fields.^3It means to increase the weight of the title to 3 times, so that documents with more matching titles will be ranked higher.
Scenario 2: Boolean query with conditions
Boolean query is the most flexible query method in ES and supportsmust(Must be met to participate in scoring),must_not(Must not be included, it will not affect the score),should(Extra points will be awarded if satisfied),filter(Must be met, does not participate in scoring but will be cached, has the best performance).
Suppose we are looking for: News released in 2023-2024, classified as 'College Entrance Examination Policy' or 'Education', with titles or content containing 'Registration' but not 'Adult College Entrance Examination', and sorted in reverse order of release time:
filterIt will not affect the score of the document and can be automatically cached by ES. It is suitable for filtering based on conditions such as time range and fixed status.
5. 3 best practices that newbies must read
-
Index design requires advance planning Once Mapping is created, in addition to adding new fields, modifications to other types or tokenizers require rebuilding the index. It is recommended to use Alias to manage indexes: During reconstruction, create a new index in the background and switch back to the alias, without any changes to the business code.
-
Performance optimization starts with details
- To avoid a single document that is too large (>100 MB), consider splitting it into multiple smaller documents;
- When operating in batches, each
bulkIt is recommended that the request size be controlled within 5–15 MB; -Multipurposefilterrather thanmust, Make full use of ES’s caching mechanism.
- Security configuration must be enabled in the production environment
This article is closed for quick testing
xpack.security, the production environment must enable HTTPS, username and password authentication, and RBAC access control to avoid data streaking.
6. Summary and expansion
This article takes you through the core concepts of ES, environment-setup, index design, document operations and common search scenarios. With just a few lines of code in Python, you can run a basic Chinese full-text search prototype. In actual projects, you can also combine highlighting, aggregate analysis, page navigation and other functions to further enrich the search experience.
If you want to explore further, we recommend the following resources:
I wish you can use ES to build a fast and accurate search engine!

