title: Web page parsing tool XPath description: Python crawler tutorial: Detailed explanation of XPath parsing technology
Python crawler tutorial: Detailed explanation of XPath parsing technology
When crawling web page data, is the ever-changing HTML structure a headache? Regular expressions are too fragile and break down when the indentation or order of tags changes slightly? Don't worry, XPath is definitely the "precision scalpel for web navigation" you are looking for. This article will help you quickly master the core usage using Python + lxml, making data extraction easy and efficient!
1. Quickly understand XPath
XPath (XML Path Language) is a path language for locating nodes in a document tree. It was originally designed for XML, but parses HTML equally well.
Why choose XPath?
- ✅ Path syntax, intuitive and easy to understand, just like operating a file system
- ✅ Rich built-in filtering functions, filter nodes as you like
- ✅ Supports all-round navigation up, down, and at the same level, leaving no corner untouched
- ✅ W3C official standard, all major mainstream languages are supported by mature parsing libraries
2. Environment preparation: Python + lxml
In the Python ecosystem, lxml is the most popular library for implementing XPath. Its bottom layer is based on C language, with fast parsing speed and strong fault tolerance - even if it encounters non-standard HTML, it can automatically repair it into a queryable tree structure.
Install dependencies
Verify successful installation
If the version number can be output normally after running (such aslxml XPath解析库版本:5.3.0), indicating that the environment is ready.
3. Step one: Turn HTML into a queryable "tree"
XPath works based on the document tree model, so we have to first convert the HTML string or local file into lxmlElementTreeobject.
Parse HTML string
Load from local HTML file
Obtained in two waystreeThe object usage is exactly the same, let’s focus on playing with it next.
4. Core node selection: path, attribute, text
gettreeobject, call its.xpath()Method and pass in an expression to query easily. The return result is usually a list of matching nodes, or a list of attribute values or text.
Quick check on basic path symbols
Attribute filtering and acquisition
XPath uses[@属性名="值"]This predicate (filter condition in square brackets) to match exactly:
Extract text content
/text(): Get only the plain text of direct child nodes//text(): Recursively get the text of all descendant nodes (may contain newlines and extra whitespace)
Tips: Use
//text()Please note that the extracted content may exceed expectations. It is recommended to use/text()Be precise and use Python when necessary.strip()Clean up.
5. Advanced techniques: Sequential selection and axis navigation
Select nodes in order
XPath has several built-in very practical position functions. It should be noted that position numbers start from 1, not 0 as is common in programming.
Axis selection: omnidirectional navigation
Axis defines the spatial relationship between the current node and the target node, allowing you to flexibly navigate complex document structures. The following is a list of commonly used axes:
For example: want to get the current<li>The brother following closely behind<li>, you can usefollowing-sibling::li[1]. This is very convenient when working with tabular or list data.
6. Practical small case: e-commerce product list extraction
When you need to loop through multiple nodes of the same type, be sure to use relative paths (starting with.//beginning)**. Otherwise, the entire document root node will be searched again every time, which is not only inefficient, but also prone to obtaining erroneous data due to confusing context.
Output result:
7. Pitfall avoidance guides and gadgets
FAQ Troubleshooting
- Encoding Garbled: After crawling the web page, use
response.encoding = response.apparent_encodingAutomatically detect the correct encoding and then pass it to lxml for parsing. - Dynamic content cannot be caught: XPath can only parse the initial HTML returned by the server. If the data is dynamically generated through JavaScript, you need to use tools such as Selenium and Playwright, or directly analyze the Ajax interface.
- Expression does not take effect in code: Paste the expression into the browser console and use
$x('//你的表达式')Real-time testing to quickly locate problems.
Tips for performance optimization
- Try to use specific tags + attributes for positioning to avoid abuse
//*Global wildcard. - From a node with a unique identifier (e.g.
id="header") Search down layer by layer to reduce global scanning. - Be sure to use ** to circulate the body
.//Relative path at the beginning**.
8. Expand learning resources
This tutorial covers the core scenarios and high-frequency usage of XPath in Python crawlers. Find a small website and give it a try! If you encounter any problems in practice, please leave a message in the comment area to communicate ~

