index
Index
Browsing e-commerce apps to search for niche snacks, logging into the backend to check last week’s orders—behind these high-frequency scenarios, there is a performance “acceleration plug-in” called index. The index of relational-database is like the table of contents of a physical book: if the book does not have a table of contents, you have to turn page by page to find knowledge points; if the database does not have an index, you can only scan the entire table to look up data. Millions of data may be stuck for a few seconds or even longer, which is completely unsuitable for a production environment.
Creating an index does consume disk and memory (just like a catalog takes up a book and needs to be printed in the mind for easy reference), but the account of "exchanging space for query time" is definitely a sure profit in most business scenarios where there is more reading and less writing.
Core index of MySQL InnoDB: B+Tree
MySQL 8.0 InnoDB supports three indexes: B+Tree, full-text, and R-Tree, but 99% of business scenarios rely on B+Tree - the reason is very simple. It is currently the data structure with the efficiency ceiling level in the field of disk mass data storage and sorting.
Basic structure of B+Tree
It is a strictly balanced tree (the path lengths from all leaf nodes to the root node are exactly the same). The tree height is usually only 3~4 layers, but it can hold millions to billions of data. Checking a single record only requires 3~4 disk I/O - an I/O of a mechanical hard disk is at the millisecond level, and 3 times is within 3ms. SSD can even reduce it to the microsecond level, which perfectly meets the business requirements of "millisecond response".
B+Tree is divided into three layers of structure according to roles (only one layer of root nodes when there is very little data):
- Root node: No actual data is stored, only "pointer to the lower node + range boundary value" is stored, which is used to quickly locate the middle/leaf node where the target falls.
- Intermediate node: The function is similar to the root node, and it is also a "boundary value + pointer" to further narrow the search scope.
- Leaf node: The only place where actual data is stored, and all leaf nodes are connected in ascending order of index column (default) using a two-way linked list - this makes range queries (such as checking products with prices between 100 and 200) extremely fast: you no longer need to jump back to the upper node to search repeatedly, just scan the linked list directly.
In addition, the internal data of leaf nodes are also sorted according to the index column, so when checking a single piece of data, you can use binary search, which is much more efficient than sequential scanning.
Clustered index vs non-clustered index
InnoDB is an index-organized table (the data itself is stored in the order of the primary key index), so the index is divided into two categories. This is also the most essential difference between it and heap-organized tables such as MyISAM:
1. Clustered Index
- Definition: The primary key of the table is the clustered index by default (if the primary key is not explicitly specified, InnoDB will automatically find the first non-empty unique index as the clustered index; if it cannot be found, it will generate a 6-byte hidden auto-increment ID).
- Features: The leaf nodes store complete rows of data - just like "the text of the book is completely arranged in directory order". Searching through the clustered index is equivalent to turning directly to the text, no need to return to the table.
- Restrictions: A table** can only have one clustered index**, otherwise the entire row of data must be stored in multiple copies, which wastes space and destroys consistency.
2. Non-clustered index (secondary index/non-primary key index)
- Definition: The indexes we manually create (for example, based on student names and product categories) are all non-clustered indexes.
- Features: The leaf nodes store the index column value + the primary key value - equivalent to "the directory only marks the knowledge point in which chapter (primary key). If you want to read the text, you have to go back to the directory of the clustered index to find the corresponding page number (the primary key locates the entire row)". This action of "turning back the clustered index" is called turning back the table.
- Performance: Usually slower than a clustered index because there is one more table return I/O.
Index practice: see performance changes from EXPLAIN
Just talking about the principle is too abstract. We use the high-frequency scenario of Checking students by name, combined with MySQL's ownEXPLAINcommand (a powerful tool for viewing SQL execution plans) to see how the index is "accelerated".
1. No index: full table scan
hypothesistb_studentThe table** only has the primary key, and no index is added to stuname**:
Output (only core fields kept):
This is a very bad plan - if there are 1 million rows in the table, you have to scan the entire 1 million rows.
2. Add ordinary index: quick positioning
GivestunameBuild a normal non-clustered index:
Execute againEXPLAIN:
The speed takes off directly! The number of scan lines is reduced from 11 to 1,Using whereAlso disappeared.
3. Prefix index: trade-off between space and time
ifstunameyesVARCHAR(200)For such long strings, building a complete index will take up more disk and memory. At this time, you can consider prefix index and only take the first N characters to build an index:
View the execution plan:
Prefix index saves space, but in order to accurately match the remaining strings, it may be necessary to scan a few more lines and triggerUsing wherefilter. The length of the prefix chosen depends on the distribution of business data: if the prefix is too short, the discrimination is low, and the effect is close to that of a full table scan; if the prefix is too long, the space saving is of little significance. This is the classic "time-space irreconcilable contradiction" in the computer world, and trade-offs need to be made based on actual data.
4. Delete indexes that are no longer needed
Unnecessary indexes must be deleted in time, as they will slow down write operations (when adding, deleting, or modifying data, all indexes must be updated simultaneously, just like when the chapters of a book are adjusted, all directories must be reprinted). There are two ways to delete an index:
Core principles of index design (pitfall avoidance guide)
Many people mistakenly believe that "the more indexes, the better". In fact, they are completely wrong - wrong indexes are not only useless, but also slow down the system. Remember the following 5 principles to avoid 80% of index pitfalls:
- Choose the right column to create an index: Give priority to
WHERE、JOIN、ORDER BY、GROUP BYCreate an index on the columns that appear in . - Select the column with a large cardinality: Cardinality = the number of unique values in the index column (gender is only male/female, the cardinality is very small; the cardinality of mobile phone number and ID number is close to the number of table rows). The larger the base, the higher the index distinction and the better the effect.
- Reasonable use of prefix index: Priority is given to prefix index for long string columns, and the prefix length is adjusted according to the business data distribution.
- The more indexes, the better: Generally, the number of single-table indexes is controlled within 5. Too many indexes will: ① occupy a lot of disk and memory; ② seriously slow down addition, deletion and modification operations; ③ make the optimizer "hesitate" when selecting an index, which may lead to the wrong choice.
- Primary keys should be as short and ordered as possible: InnoDB’s non-clustered index leaf nodes will store primary key values. The shorter the primary key (for example, use
INTSelf-increasing, don’t use longUUID), the smaller the non-clustered index, the more data can be cached in the memory, and the faster the query.
"Effective boundary" of InnoDB B+Tree index
Not all SQL with indexed columns can use indexes. In the following cases, the index will invalid (with a high probability of degenerating into a full table scan):
- numeric type: only
=、>、<、>=、<=、BETWEEN...AND...、INCan index normally;NOT IN、<>(Not equal to) usually leads to failure (unless the cardinality is extremely large). - String type: only queries with unambiguous prefix (e.g.
stuname LIKE '林%') will go through the index;LIKE '%林'、LIKE '%林%'All will fail. - Index columns participate in operations or functions: for example
WHERE YEAR(create_time) = 2024、WHERE age + 1 = 18All will fail. should be changed toWHERE create_time BETWEEN '2024-01-01' AND '2024-12-31'、WHERE age = 17。 - Compound index does not follow the leftmost prefix principle: If it is built
idx_name_age_gender, only checkageOr just checkgender, or check firstageCheck againname, the index will be invalid - continuous matching must start from the first column (leftmost prefix) of the index.
Mastering the above core knowledge of MySQL InnoDB B+Tree index is enough to deal with the performance problems of most business scenarios!

