title: Web page parsing tool XPath description: Python crawler tutorial: Detailed explanation of XPath parsing technology

Python crawler tutorial: Detailed explanation of XPath parsing technology

When crawling web page data, is the ever-changing HTML structure a headache? Regular expressions are too fragile and break down when the indentation or order of tags changes slightly? Don't worry, XPath is definitely the "precision scalpel for web navigation" you are looking for. This article will help you quickly master the core usage using Python + lxml, making data extraction easy and efficient!


1. Quickly understand XPath

XPath (XML Path Language) is a path language for locating nodes in a document tree. It was originally designed for XML, but parses HTML equally well.

Why choose XPath?

  • ✅ Path syntax, intuitive and easy to understand, just like operating a file system
  • ✅ Rich built-in filtering functions, filter nodes as you like
  • ✅ Supports all-round navigation up, down, and at the same level, leaving no corner untouched
  • ✅ W3C official standard, all major mainstream languages ​​​​are supported by mature parsing libraries

2. Environment preparation: Python + lxml

In the Python ecosystem, lxml is the most popular library for implementing XPath. Its bottom layer is based on C language, with fast parsing speed and strong fault tolerance - even if it encounters non-standard HTML, it can automatically repair it into a queryable tree structure.

Install dependencies

pip install lxml

Verify successful installation

from lxml import etree

print(f"lxml XPath解析库版本:{etree.__version__}")

If the version number can be output normally after running (such aslxml XPath解析库版本:5.3.0), indicating that the environment is ready.


3. Step one: Turn HTML into a queryable "tree"

XPath works based on the document tree model, so we have to first convert the HTML string or local file into lxmlElementTreeobject.

Parse HTML string

from lxml import etree

sample_html = """
<html>
    <body>
        <div class="container">
            <ul>
                <li class="item-0"><a href="link1.html">first item</a></li>
                <li class="item-1"><a href="link2.html">second item</a></li>
                <li class="item-inactive"><a href="link3.html">third item</a></li>
                <li class="item-1"><a href="link4.html">fourth item</a></li>
                <li class="item-0"><a href="link5.html">fifth item</a></li>
            </ul>
        </div>
    </body>
</html>
"""

# 初始化 HTML 解析器(自动修复不规范的 HTML)
html_parser = etree.HTMLParser()
tree = etree.fromstring(sample_html, html_parser)

# 打印修复并格式化后的 HTML,确认结构
print(etree.tostring(tree, pretty_print=True, method="html").decode("utf-8"))

Load from local HTML file

# 解析本地的 test.html 文件
tree = etree.parse("test.html", etree.HTMLParser())

Obtained in two waystreeThe object usage is exactly the same, let’s focus on playing with it next.


4. Core node selection: path, attribute, text

gettreeobject, call its.xpath()Method and pass in an expression to query easily. The return result is usually a list of matching nodes, or a list of attribute values ​​or text.

Quick check on basic path symbols

Grammar symbolsFunctionExampleMeaning
/Start at the root node, or select direct child nodestree.xpath('/html/body/div/ul')The only one in the selected document<ul>
//Select all descendant nodes from anywhere in the documenttree.xpath('//li')Select all 5<li>
*Wildcard, matches any node nametree.xpath('//li/*')Select all<li>direct child node of (i.e.<a>
.Current node (commonly used for relative positioning within loops)See the actual case below
..parent nodetree.xpath('//a[@href="link4.html"]/../@class')Return['item-1']

Attribute filtering and acquisition

XPath uses[@属性名="值"]This predicate (filter condition in square brackets) to match exactly:

# 筛选 class 属性等于 "item-0" 的所有 li
filtered_li = tree.xpath('//li[@class="item-0"]')
print(len(filtered_li))   # 输出:2

# 多条件组合:使用 and / or 连接
multi_attr_li = tree.xpath('//li[contains(@class, "item") and position()<3]')
# contains():模糊匹配属性值,对多值 class(如 "item active")尤其好用
# position():返回节点在兄弟中的位置

# 直接获取属性值:在路径后追加 /@属性名
all_hrefs = tree.xpath('//li/a/@href')
print(all_hrefs)   # ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

Extract text content

  • /text(): Get only the plain text of direct child nodes
  • //text(): Recursively get the text of all descendant nodes (may contain newlines and extra whitespace)
# 获取所有 a 标签的直接文本
a_texts = tree.xpath('//li/a/text()')
print(a_texts)   # ['first item', 'second item', 'third item', 'fourth item', 'fifth item']

Tips: Use//text()Please note that the extracted content may exceed expectations. It is recommended to use/text()Be precise and use Python when necessary.strip()Clean up.


5. Advanced techniques: Sequential selection and axis navigation

Select nodes in order

XPath has several built-in very practical position functions. It should be noted that position numbers start from 1, not 0 as is common in programming.

# 第一个 li
first_li = tree.xpath('//li[1]/a/text()')
print(first_li)   # ['first item']

# 最后一个 li
last_li = tree.xpath('//li[last()]/a/text()')
print(last_li)    # ['fifth item']

# 倒数第三个
last_3rd = tree.xpath('//li[last()-2]/a/text()')
print(last_3rd)   # ['third item']

Axis selection: omnidirectional navigation

Axis defines the spatial relationship between the current node and the target node, allowing you to flexibly navigate complex document structures. The following is a list of commonly used axes:

Axis nameFunction
ancestor::Select all ancestor nodes of the current node (parent, grandparent... until the root)
attribute::Select all attributes of the current node (usually abbreviated as@*
child::Selects the direct child nodes of the current node (usually abbreviated as/
descendant::Select all descendant nodes of the current node (usually abbreviated as//
following-sibling::Select all sibling nodes after the current node
preceding-sibling::Select all sibling nodes before the current node

For example: want to get the current<li>The brother following closely behind<li>, you can usefollowing-sibling::li[1]. This is very convenient when working with tabular or list data.


6. Practical small case: e-commerce product list extraction

When you need to loop through multiple nodes of the same type, be sure to use relative paths (starting with.//beginning)**. Otherwise, the entire document root node will be searched again every time, which is not only inefficient, but also prone to obtaining erroneous data due to confusing context.

from lxml import etree

product_html = """
<div class="product-list">
    <div class="product">
        <h3><a href="/product/1">复古蓝牙音箱</a></h3>
        <span class="price">¥199.00</span>
        <span class="sales">已售2300件</span>
    </div>
    <div class="product">
        <h3><a href="/product/2">机械键盘青轴</a></h3>
        <span class="price">¥349.00</span>
        <span class="sales">已售8700件</span>
    </div>
</div>
"""

tree = etree.fromstring(product_html)
products = []

for p_node in tree.xpath('//div[@class="product"]'):
    # 关键!使用相对路径 .// 从当前 p_node 开始查找
    name = p_node.xpath('.//h3/a/text()')[0]
    price = p_node.xpath('.//span[@class="price"]/text()')[0]
    sales = p_node.xpath('.//span[@class="sales"]/text()')[0]
    products.append({
        "name": name,
        "price": price,
        "sales": sales
    })

print(products)

Output result:

[
    {'name': '复古蓝牙音箱', 'price': '¥199.00', 'sales': '已售2300件'},
    {'name': '机械键盘青轴', 'price': '¥349.00', 'sales': '已售8700件'}
]

7. Pitfall avoidance guides and gadgets

FAQ Troubleshooting

  • Encoding Garbled: After crawling the web page, useresponse.encoding = response.apparent_encodingAutomatically detect the correct encoding and then pass it to lxml for parsing.
  • Dynamic content cannot be caught: XPath can only parse the initial HTML returned by the server. If the data is dynamically generated through JavaScript, you need to use tools such as Selenium and Playwright, or directly analyze the Ajax interface.
  • Expression does not take effect in code: Paste the expression into the browser console and use$x('//你的表达式')Real-time testing to quickly locate problems.

Tips for performance optimization

  • Try to use specific tags + attributes for positioning to avoid abuse//*Global wildcard.
  • From a node with a unique identifier (e.g.id="header") Search down layer by layer to reduce global scanning.
  • Be sure to use ** to circulate the body.//Relative path at the beginning**.

8. Expand learning resources


This tutorial covers the core scenarios and high-frequency usage of XPath in Python crawlers. Find a small website and give it a try! If you encounter any problems in practice, please leave a message in the comment area to communicate ~