A comprehensive guide to Python regular expressions

Crawling real-time prices on e-commerce detail pages, parsing server logs to extract high-risk IPs, and cleaning garbled tags in texts in batches—these are high-frequency and detailed text processing tasks, and Regular Expression (Regular) are all efficiency artifacts for Python developers. This article will focus on Python’s built-inreModule, from basic syntax to advanced techniques to actual crawler combat, systematically combing its usage.


1. Basics of regular expressions

1.1 What is regularity

Regex is a set of grammatical systems specifically used to "describe string rules". You can think of it as a "sieve template" for text: draw a specific combination of symbols on the template, and the text that conforms to the rules will be accurately "screened out", and the filtered parts can also be replaced.

1.2 Getting Started Tools

Don’t rush to write Python code yet, use lightweight online tools to verify your ideas the fastest! Two common Chinese tools are recommended:

Experiment with a piece of test text:

我的电话号码是:13812345678,备用邮箱是:user.name@mail.co.uk,测试网站:https://blog.example.com/posts/123#comment

Here are two introductory matching demos (note the limitations of base mode):

  • Match simple mobile phone number/pure numeric prefix single domain name email:\w+@\w+\.\w+→ Can only catch something likeexample@domain.commailbox
  • Matches borderless URL with optional https:https?://[^\s]+→ Basically it can capture all common URLs, but it will bring the punctuation at the end.

2. Python reModule core methods

Python does not have a built-in regular engine, but it provides a standard libraryre, which encapsulates commonly used operation interfaces. Let’s first look at a quick reference table:

MethodFunctionKey return value
re.match()Force match from the beginning of the stringReturn successfullyMatch 对象, return on failureNone
re.search()Scan the entire string and return the first matchSame asre.match()
re.findall()Scan the entire string and return all matching resultsReturn successfully列表(Group matching returns a list of tuples), failure returns空列表
re.sub()Replace all matched textReturn successfully替换后的新字符串
re.compile()Precompiled regular expression (for performance optimization)ReturnPattern 对象, can be reused to perform matching/replacement operations

3. Detailed explanation of common matching methods

3.1 match(): Beginning matching method

This method has a "hard requirement" - it must fully comply with the rules starting from the first character of the string, otherwise it will return directlyNone, especially suitable for strictly formatted beginning verification.

Basic usage

import re

# 测试文本开头必须是 Hello + 3 位数字 + 4 位数字 + 10 位字母/数字/下划线
content = 'Hello 123 4567 World_This'
pattern = r'^Hello\s\d{3}\s\d{4}\s\w{10}'  # 注意加 r 前缀避免转义冲突

result = re.match(pattern, content)
if result:
    print(result.group())  # 打印完整匹配内容:Hello 123 4567 World_This
    print(result.span())   # 打印匹配的起始/结束索引:(0, 25)

Group extraction (core function!)

If you need to pick out a certain part separately from the matched text, use()"Frame" the content you want to remove - this is grouping.group(0)is a complete match,group(1)is the content of the first bracket,group(2)The second one, and so on.

content = 'Hello 1234567 World_This'
pattern = r'^Hello\s(\d+)\sWorld'

result = re.match(pattern, content)
if result:
    print(result.group(1))  # 单独抠出数字:1234567

3.2 search(): Global scan to find the first one

andmatch()different,search()It will skip the part that does not match the beginning and return the first that matches the rules in the entire string. It is more useful for daily use.match()Much more.

content = '开头的废话 Extra Hello 1234567 World_This'
pattern = r'Hello\s(\d+)\sWorld'

result = re.search(pattern, content)
if result:
    print(result.group(1))  # 依然能抠到:1234567

3.3 findall(): Global scan to find all

If you need to extract all content that meets the rules at once (such as crawling all links of the entire product list), usefindall()——It returns a list, no need to loop and judge again.

# 模拟一个简单的 HTML 商品列表片段
html = '''
<ul class="product-list">
    <li><a href="https://shop.example.com/p1">苹果15Pro</a></li>
    <li><a href="https://shop.example.com/p2">华为Mate60</a></li>
    <li><a href="https://shop.example.com/p3">小米14Ultra</a></li>
</ul>
'''

# 分组提取链接和商品名,re.S 让 . 匹配换行符
pattern = r'<li><a href="(.*?)">(.*?)</a></li>'
results = re.findall(pattern, html, re.S)

for url, name in results:
    print(f'商品:{name},链接:{url}')

4. Advanced matching techniques

4.1 Greedy vs. non-greedy matching (crawlers must understand!)

This is the easiest pitfall for novices——.*and*The default of this type is greedy matching: it will "swallow" as many following characters as possible; and.*?and+?This type is non-greedy matching: it will "swallow" as little as possible and stop when it finds the boundary of the next rule.

content = 'He1234567WoDemo'

# 贪婪匹配:.* 吞掉了 123456,只留最后一个 7 给 \d+
pattern_greedy = r'^He.*(\d+).*Demo$'
result_greedy = re.search(pattern_greedy, content)
print(result_greedy.group(1))  # 7

# 非贪婪匹配:.*? 只吞到第一个数字前就停,把 1234567 全留给 \d+
pattern_lazy = r'^He.*?(\d+).*Demo$'
result_lazy = re.search(pattern_lazy, content)
print(result_lazy.group(1))  # 1234567

4.2 Modifiers (making regular expressions more flexible)

Modifiers can change the default matching rules of regular expressions. There are 3 commonly used ones:

ModifierCore roleReptile scene
re.IIgnore caseIgnore case when matching images.jpgstill.JPG
re.Slet.Matching newlinesParsing HTML tags often spans multiple lines
re.MMulti-line mode (^and$Match the beginning/end of each line)Parse multiple lines of log files

4.3 Escape matching (processing special characters)

There are many "special symbols" in regular expressions (such as. * + ( ) [ ] ?), if you want to match the literal meaning of these symbols, you must add a backslash in front\Escape.

content = '(CSDN)https://blog.csdn.net'
pattern = r'\(CSDN\)https://blog\.csdn\.net'

result = re.search(pattern, content)
if result:
    print('匹配成功')

5. Practical auxiliary methods

5.1 sub(): Batch replacement

This method can replace all matched text at once and is suitable for cleaning garbled characters, tags, sensitive words, etc.

# 把文本里的所有数字替换成空,提取纯字母
content = '54aK54yr5oiR54ix5L2g'
clean_content = re.sub(r'\d+', '', content)
print(clean_content)  # aKyroiRixLg

# 批量替换 HTML 的 <p> 标签为换行
html_p = '<p>第一行</p><p>第二行</p>'
clean_html = re.sub(r'</?p>', '\n', html_p).strip()  # strip() 去掉首尾换行
print(clean_html)

5.2 compile(): Precompiled regular expression (performance optimization)

If you need to reuse the same regular rule for a large amount of text (for example, crawling 1,000 product pages using the same price matching rule), use it firstre.compile()precompiled intoPattern 对象, which can avoid re-parsing the regular syntax for every match and improve efficiency.

# 预编译一个时间格式的正则(HH:MM)
time_pattern = re.compile(r'\d{2}:\d{2}')

# 对多个文本重复使用
texts = ['现在是12:34', '明天23:45下班', '后天休息']
for text in texts:
    result = time_pattern.search(text)
    if result:
        print(f'找到时间:{result.group()}')

6. Common matching pattern cheat sheet

ModeDescriptionCommon Scenarios
.matches any character (except newline unless you addre.S)Temporary placeholder across characters
\wMatch letters, numbers, and underscoresMatch user names and file name prefixes
\sMatches whitespace characters (spaces, tabs, newlines)Matches gaps in text
\dMatching numbersMatching prices, mobile phone numbers, IDs
^Match the beginning of the stringBeginning format verification
$Match the end of the stringEnd format verification
*Matches the preceding character 0 or more timesMatches optional repetitions
+Matches the previous character 1 or more timesMatches repeated content that must exist
?Matches the preceding character 0 or 1 timesMatches an optional single content
{n}Match the previous character exactly n timesMatch a fixed length of numbers/letters
{n,}Match the previous character at least n timesMatch long text fragments
[abc]Matches any one of a, b, cMatches a limited number of optional characters
[^abc]Matches any character except a, b, cMatches limited exclusion characters

7. Tips for performance optimization

Although regular expressions are powerful, their misuse may lead to performance bottlenecks or even stuck programs (such as "backtracking explosion" caused by complex greedy matching). Remember these points to avoid most problems:

  1. ** Prioritize non-greedy matching.*?**, reduce the number of backtracking
  2. ** Used when a large number of repeated matchesre.compile()Precompiled**
  3. You can use string built-in methods (such asfind()replace()), don’t use regular expressions
  4. Set boundary characters reasonably, such as using[^"]replace.*?Match content within double quotes

Regular expressions are an essential skill for getting started with crawler development, and they are also the "Swiss Army Knife" of text processing. After mastering the basic grammar, it is recommended to practice more with real HTML and log text, and slowly accumulate a library of commonly used patterns ~