Efficient application of regular expressions

Have you ever experienced these crazy moments?

  • The comments that crawled back contained mobile phone numbers in dozens of formats, and they were manually copied and pasted to the point of confusion;
  • The course list in the office is separated by a mixture of full-width and half-width commas, semicolons and even spaces..split()Straight away;
  • Sensitive words in the chatbot log need to be replaced with*, but also ignore case.

If there was a text processing Swiss Army Knife that could allow you to solve all complex patterns with a set of rules, would you be excited? This knife is Regular Expression (Regex for short). It is the core tool frequently used by Daoman Python AI in crawlers, log cleaning, and automated office scripts.

Next, we will not talk about heavenly books, but only useful ideas.


1. Remember these 8 metacharacters first, and you can handle 80% of the scenarios.

Regular expressions are composed of normal characters (such as a, 5) and magic metacharacters. Novices don’t need to memorize all the symbols. First, take the following cheat sheet and you can write down most of the practical rules:

SymbolDefinitionExampleMatching results
.Any single character except newlineb.tbatb1tb#t
\dNumber, equivalent to[0-9]\d{3}123955
\wLetters, numbers, and underscores are equivalent to[a-zA-Z0-9_]\w+python_3hello
\sWhitespace characters such as spaces, tabs, newlines, etc.love\syoulove youlove you
^ / $Start/end of string^ThematchesTheLines starting with
[]A collection of characters, matching any one of them[aeiou]aeo
* / +Repeat 0 or more times / at least 1 time\d+11234
?Repeat 0 or 1 times (non-greedy mode can also be turned on)https?httphttps

⚠️ Escape Reminder: When you want to match.(*Remember to precede these special "metacharacters" with a backslash, for example\.\(. Otherwise they become rules rather than literals.


2. Use Python wellre6 core functions of the module

Python built-inreThe module is a regular execution engine, and there is no need to install a third-party library. For daily use, it is enough to keep these 6 "backbone" functions in mind:

Core functionsFunction descriptionUsage scenarios
re.compile(pattern)Precompiled regular rules, repeated use can greatly improve efficiencyUsed when running 100,000 logs in a loop
re.match()Start matching from the beginning of the string **Verify the format, such as determining whether it starts with the email address
re.search()Scan the entire string and return the first matching objectFind the first keyword in the text
re.findall()Return a list of all matching resultsExtract mobile phone numbers and emails in batches
re.sub()Replace all matching itemsSensitive word filtering, unified time format
re.split()Flexibly split strings according to regular rulesProcess mixed text with multiple delimiters

3. Three ready-to-use practical cases

Just talking about theory is too boring. Let’s go directly to three scenarios that you will definitely encounter in office/crawler/daily work.

Case 1: One-click filtering of sensitive words that ignore case

usere.subMatch the flagre.IGNORECASE(can be abbreviated asre.I), combined with[]By dealing with homophones and similar words, you can quickly desensitize:

import re

raw_comment = "Oh, Shit! 你是傻逼吗? Fuck you. 沙比东西"

# 模式:用 | 连接精确词,[] 收容变体
sensitive_pattern = r'fuck|shit|[傻煞沙][比笔逼叉缺雕]'

# 替换为 *,忽略大小写
clean_comment = re.sub(sensitive_pattern, '*', raw_comment, flags=re.I)
print(clean_comment)  # 输出:Oh, *! 你是*吗? * you. *东西

Case 2: Obtain mobile phone number from chaotic information

Suppose we have a bunch of logs with mixed names and need to extract all mobile phone numbers that comply with domestic rules (starting with 1, second digit 3-9, followed by 9 digits).re.findallDone in one step, don’t add any more^...$, that is used to verify the entire string:

import re

raw_log = "下单人:张三,电话13512345678;投诉人:李四,留的不是110,是15688889999"

phone_pattern = r'1[3-9]\d{9}'
phones = re.findall(phone_pattern, raw_log)
print(phones)  # 输出:['13512345678', '15688889999']

Case 3: Split all mixed delimiters in one go

When commas, semicolons, vertical bars, full-width characters, and an unlimited number of spaces appear simultaneously in the string, the normalstr.split()There will be a strike. andre.splitUse one pattern to get all delimiters:

import re

raw_course = "Python,Java;Go  C++|Rust,TypeScript"
# 分隔符模式:, ; 竖线 中文逗号 或 1 个以上空白
split_pattern = r'[,;;\|\s,]+'
courses = re.split(split_pattern, raw_course)
print(courses)  # 输出:['Python', 'Java', 'Go', 'C++', 'Rust', 'TypeScript']

4. Greedy matching vs. non-greedy matching: the most common pitfall for novices

Regularity is greedy by default - it matches as many matches as it can until it is full. add one?You can make it lazy and stop when it encounters the first end mark. Let’s look at a comparative experiment of HTML extraction:

import re

html_text = "<div>苹果</div><div>香蕉</div>"

# 贪婪模式:从第一个 <div> 一直吃到最后一个 </div>
greedy_result = re.findall(r'<div>.*</div>', html_text)
print(greedy_result)  # 输出:['<div>苹果</div><div>香蕉</div>']

# 非贪婪模式:.*? 遇到第一个 </div> 立刻收手
lazy_result = re.findall(r'<div>.*?</div>', html_text)
print(lazy_result)  # 输出:['<div>苹果</div>', '<div>香蕉</div>']

remember:.*?It is the golden partner in data extraction. The reason for many seemingly complex requirements is that I forgot to add this?


Finally: 3 Tips to Avoid Pitfalls to Get Started Right Now

  1. For stringsr''Original tag Python itself\nwill be interpreted as a newline, and the regular expression needs to match literals\nIt's easy to conflict. user''In wrapping mode, you can keep the backslashes as they are, and you no longer have to worry about escaping confusion.

  2. Be sure to precompile before looping If you want to use the same regular expression thousands of times in a for loop, first usecompiled = re.compile(pattern)Save and call againcompiled.findall(). The measured efficiency can be increased by 30% ~ 50%, and the larger the data volume, the more obvious the benefits.

  3. Debug online first, then write code It is recommended to use regex101.com, which can not only visualize the matching process in color, but also generate codes in Python, Java, JavaScript and other languages ​​with one click. Make corrections while testing, and say goodbye to the pain of "relying on metaphysics to write regular rules".


(Full text, about 2,000 words)