A practical introduction to Python regular expressions: the whole process from matching to extraction

Regular expressions (Regular Expression) are the "Swiss Army Knife" in the hands of programmers - using a set of refined syntax to accurately match, cut, replace and even extract various information in text. Python built-inreThe module is the interface of this knife. This article skips the boring theoretical accumulation and directly follows the route of "Grammar Quick Check -> Common Methods -> Practical Scenarios -> Pitfall Avoidance Guide" to help you quickly use regular expressions.


1. Core Grammar Quick Check: Memorize high-frequency rules in 3 minutes

1.1 Basic matching characters: how to match a single character

SymbolFunctionEquivalent writing or examples
Write characters directlyMatch this character exactly (. *Special symbols must be escaped and written as\. \*aJust match the letter a
\dMatch a number[0-9]
\wMatches a letter, number or underscore[a-zA-Z0-9_]
\smatches a whitespace character (space,\t\netc.)invisible characters
.Matches any character that is not a newlineWildcard character, except newline

1.2 Quantity qualifier: repeat the previous character several times

The default is greedy mode (match as many as possible), add?You can switch to non-greedy mode, which will be discussed in detail later.

SymbolMeaning
*0 or more times (may not appear)
+1 or more times (at least 1 occurrence)
?0 or 1 times (optional)
{n}exactly n times
{n,}at least n times
{n,m}At least n times, at most m times
# 基础限定符示例
r'00\d'       # 00开头,后面跟一个数字,比如 007,不会匹配 00A
r'\d{3}'      # 恰好 3 位数字,如 010
r'\w\w\d'     # 两个“字母/数字/下划线”+一个数字,比如 py3、123
r'py.'        # py 开头,后面任意一个非换行字符,如 pyc、py!、pya

1.3 Character set and range: Customize which characters to match

WritingMatching rules
[abc]Any one of a, b, c
[a-z]Any lowercase letter
[0-9a-zA-Z_]and\wSame
[^abc]Any character except a, b, c
# 字符集示例
r'[0-9a-zA-Z_]+'    # 匹配由变量名合法字符组成的字符串
r'[a-zA-Z_]\w*'     # 匹配 Python 合法变量名(只能字母或下划线开头)

1.4 Boundaries and logic: control locations and branches

SymbolFunction
^Matches the beginning of the string (matches the beginning of the line in multi-line mode)
$Match the end of the string (match the end of the line in multi-line mode)
\bMatch word boundaries (\wand non\w’s junction)
`AB`
(pattern)Capture grouping, matching content can be extracted separately
(?:pattern)Non-capturing grouping, only used for grouping range, not extracted separately
# 边界与分组示例
r'^py$'                # 精确匹配整个字符串为 "py",不会匹配 "python"
r'(P|p)ython'          # 匹配 "Python" 或 "python"
r'^(\d{3})-(\d{3,8})$' # 分组匹配 "区号-号码",如 010-12345678

2. Python re module: common methods in one step

2.1 Be sure to use the r prefix when writing regular expressions

Python strings themselves will\as an escape character. If there are many in the regular\d\sSuch backslashes are not addedrIf you do, you have to write many layers of backslashes, which is very easy to make mistakes.

# ❌ 错误:想匹配 \d{3},却要写成这样
pattern_error = '\\\\d{3}'
# ✅ 正确:使用 r 前缀,清清楚楚
pattern_correct = r'\d{3}'

addrAfter that, Python will keep each element in the string intact\, the regular engine can handle it correctly.

2.2 Overview of high-frequency methods

MethodWhat it is used forReturn value
re.match(p, s)Try to match from start of stringReturn successfullyMatchObject, returned on failureNone
re.search(p, s)Search entire string, find first matchSame as above
re.findall(p, s)Search the entire string and find all matchesReturn a list of grouped tuples if there is grouping; return a list of matching strings without grouping
re.split(p, s)Split string according to regular patternSplit list
re.sub(p, repl, s)Replace all matching contentThe new string after replacement
re.compile(p)Precompiled regex for reuseAPatternObject, all methods above can be called

Group extraction practice

Grouping is one of the most practical capabilities of regular expressions. After successful matching, useMatchObjectgroup()The method can extract the content in the brackets separately:

import re

# 预编译电话分组正则
phone_pattern = re.compile(r'^(\d{3})-(\d{3,8})$')
match_obj = phone_pattern.match('010-1234567')

if match_obj:
    print(match_obj.group(0))  # 整个匹配:'010-1234567'
    print(match_obj.group(1))  # 第一个括号:'010'
    print(match_obj.group(2))  # 第二个括号:'1234567'
    print(match_obj.groups())  # 所有分组元组:('010', '1234567')

3. Advanced skills & practical scenarios

3.1 Non-greedy matching: avoid "biting off more than you can chew"

default*+They are all greedy and can eat everything they can match. add one after them?It will become a "minimum match" and stop when it's good.

# 场景:提取“前面的数字”和“末尾的 0”
# 默认贪婪:第一个分组把最后的 0 全吃掉了
greedy_match = re.match(r'^(\d+)(0*)$', '102300').groups()
print(greedy_match)  # ('102300', '')

# 非贪婪:第一个分组只取到最后一个非 0 数字
non_greedy_match = re.match(r'^(\d+?)(0*)$', '102300').groups()
print(non_greedy_match)  # ('1023', '00')

3.2 Daily Scenario 1: Dealing with confusing separators

Python stringssplit()It can only be cut according to one delimiter, and it will be useless when you encounter dirty data mixed with spaces, semicolons, and commas. regularre.split()A set of "separator sets" can be defined.

# 场景:用户输入“姓名-年龄-地址”,分隔符花样百出
messy_input = '张三,20;;  北京市朝阳区-李四;25 上海市徐汇区'

# 中括号里的字符都是分隔符,+ 表示连续出现时算一个
clean_result = re.split(r'[\s,;-]+', messy_input.strip())
print(clean_result)
# 输出:['张三', '20', '北京市朝阳区', '李四', '25', '上海市徐汇区']

3.3 Daily Scenario 2: Extract and organize important information in logs

# 场景:从服务器日志中把所有 IP 地址找出来
log_text = '''
2024-05-20 10:00:01 192.168.1.1 请求成功
2024-05-20 10:00:05 10.0.0.5 请求超时
2024-05-20 10:00:10 172.16.0.3 请求成功
'''

# IP 地址的简化版正则(生产环境中需要更严谨的写法)
ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
ips = re.findall(ip_pattern, log_text)
print(ips)  # ['192.168.1.1', '10.0.0.5', '172.16.0.3']

4. Best practices and pitfall avoidance guides

4.1 Best Practices

  1. Precompiled high-frequency regular expression: If the same regular expression will be used many times, use it firstre.compile()Compiles well and runs faster.
  2. Complex regular readability optimization: usere.VERBOSEMode, allowing line breaks, indentation and comments, turning the "heavenly book" into an "instruction manual".
  3. Start simple and test in modules: Don’t write a long list at the beginning. Test a small part first and make sure it is correct before assembling it.
  4. Test all boundary conditions: empty strings, longest matching content, mixed special symbols, etc., all are thrown in and run.
# 带注释的可维护性正则(验证一个简单 Email)
readable_email = re.compile(r"""
    ^                   # 字符串开始
    [a-zA-Z0-9._%+-]+   # 用户名部分
    @                   # @ 符号
    [a-zA-Z0-9.-]+      # 主域名
    \.                  # 真正的点(转义)
    [a-zA-Z]{2,}        # 顶级域名,至少两个字母
    $                   # 字符串结束
""", re.VERBOSE)

4.2 Pitfall avoidance guide

  • Forgot to escape special symbols:.is a wildcard, to match real points it must be written as\.. In the same way,*+(If you only want to use it as a normal character, you must escape it.
  • Can't tell the differencematchandsearchmatchJust look at the beginning,searchIt is the first match found in the full text. In most scenarios, what we actually need issearchorfindall
  • Killing a Chicken with a Bull's Knife: If you just want to determine whether a certain substring is in the string, use it directly"py" in sThat's enough, simpler and more direct than regular.

Regular expressions are a powerful but easy-to-write tool that is confusing to read. This article covers 90% of daily usage scenarios. If you need more advanced functions (such as Unicode character matching, recursive mode), you can consider Python third-party librariesregex. Master these skills in the text, and you will be able to handle most text problems with ease.