Efficient application of regular expressions
Have you ever experienced these crazy moments?
- The comments that crawled back contained mobile phone numbers in dozens of formats, and they were manually copied and pasted to the point of confusion;
- The course list in the office is separated by a mixture of full-width and half-width commas, semicolons and even spaces.
.split()Straight away; - Sensitive words in the chatbot log need to be replaced with
*, but also ignore case.
If there was a text processing Swiss Army Knife that could allow you to solve all complex patterns with a set of rules, would you be excited? This knife is Regular Expression (Regex for short). It is the core tool frequently used by Daoman Python AI in crawlers, log cleaning, and automated office scripts.
Next, we will not talk about heavenly books, but only useful ideas.
1. Remember these 8 metacharacters first, and you can handle 80% of the scenarios.
Regular expressions are composed of normal characters (such as a, 5) and magic metacharacters. Novices don’t need to memorize all the symbols. First, take the following cheat sheet and you can write down most of the practical rules:
⚠️ Escape Reminder: When you want to match
.、(、*Remember to precede these special "metacharacters" with a backslash, for example\.、\(. Otherwise they become rules rather than literals.
2. Use Python wellre6 core functions of the module
Python built-inreThe module is a regular execution engine, and there is no need to install a third-party library. For daily use, it is enough to keep these 6 "backbone" functions in mind:
3. Three ready-to-use practical cases
Just talking about theory is too boring. Let’s go directly to three scenarios that you will definitely encounter in office/crawler/daily work.
Case 1: One-click filtering of sensitive words that ignore case
usere.subMatch the flagre.IGNORECASE(can be abbreviated asre.I), combined with[]By dealing with homophones and similar words, you can quickly desensitize:
Case 2: Obtain mobile phone number from chaotic information
Suppose we have a bunch of logs with mixed names and need to extract all mobile phone numbers that comply with domestic rules (starting with 1, second digit 3-9, followed by 9 digits).re.findallDone in one step, don’t add any more^...$, that is used to verify the entire string:
Case 3: Split all mixed delimiters in one go
When commas, semicolons, vertical bars, full-width characters, and an unlimited number of spaces appear simultaneously in the string, the normalstr.split()There will be a strike. andre.splitUse one pattern to get all delimiters:
4. Greedy matching vs. non-greedy matching: the most common pitfall for novices
Regularity is greedy by default - it matches as many matches as it can until it is full. add one?You can make it lazy and stop when it encounters the first end mark. Let’s look at a comparative experiment of HTML extraction:
remember:.*?It is the golden partner in data extraction. The reason for many seemingly complex requirements is that I forgot to add this?。
Finally: 3 Tips to Avoid Pitfalls to Get Started Right Now
-
For strings
r''Original tag Python itself\nwill be interpreted as a newline, and the regular expression needs to match literals\nIt's easy to conflict. user''In wrapping mode, you can keep the backslashes as they are, and you no longer have to worry about escaping confusion. -
Be sure to precompile before looping If you want to use the same regular expression thousands of times in a for loop, first use
compiled = re.compile(pattern)Save and call againcompiled.findall(). The measured efficiency can be increased by 30% ~ 50%, and the larger the data volume, the more obvious the benefits. -
Debug online first, then write code It is recommended to use regex101.com, which can not only visualize the matching process in color, but also generate codes in Python, Java, JavaScript and other languages with one click. Make corrections while testing, and say goodbye to the pain of "relying on metaphysics to write regular rules".
(Full text, about 2,000 words)

