Python crawler parsing library parsel tutorial
If you feel that BeautifulSoup's parsing speed is not fast enough when writing a Python crawler, and you don't want to build a Scrapy framework for an efficient Selector, then the library introduced today is your "savior" - parsel. It is extracted from Scrapy and inherits powerful selector capabilities while remaining minimalist and lightweight, allowing you to get started in a few minutes and write clear and efficient parsing code.
1. What is parsel?
parsel is not a new project. It is the core parsing library officially split from Scrapy. It was originally included in the Scrapy framework.Selector。
It has all the advantages of Scrapy Selector, but can be installed and used independently. It is very suitable for:
-Write a small crawler
- Data cleaning
- Scenarios where you don’t want to introduce the entire Scrapy but need high-performance parsing
The core highlights of parsel:
- ✅ Dual engine parsing: The bottom layer is based on lxml, supports XPath 1.0 and CSS Selector (even supports CSS→XPath automatic conversion)
- ✅ Three extraction modes: pure CSS, pure XPath, mixed chain call of CSS and XPath, you can cut it however you want
- ✅ Minimalist and secure API:
get()、getall()、get(default=…)Replaced the cumbersome error handling in the past - ✅ Built-in regular support: No need to take out the content first and then do regular regularization separately, call it directly on the selector
.re()or.re_first() - ✅ Seamless integration with Scrapy: Practice code can be directly moved to Scrapy
parseIn the method, zero changes
2. Installation
Only one pip command is needed, and parsel will automatically install the dependent lxml:
Once installed, you can use it in any Python script.
3. Get started quickly
We use a simulated HTML content to demonstrate the most commonly used extraction method of parsel.
3.1 Create Selector object
After getting the HTML text, useparsel.Selector(text=…)Just pack it. The bottom layer of parsel will automatically handle issues such as closing tags and encoding with the help of lxml.
3.2 Extract with CSS selector
If you have a front-end foundation, CSS selectors are the most friendly way, and the writing method is almost the same as how you usually write styles.
3.3 Extract using XPath
XPath will be more flexible when encountering complex nested relationships and need to locate sibling nodes or ancestor nodes.
4. Core extraction method
Regardless of whether you use CSS or XPath positioning, the final data extraction relies on the following three methods, which are very easy to remember:
4.1 Details of extracting text
- CSS requires the use of extended syntax
::textget text,::attr(属性名)Get attribute value; - XPath usage
/text()Get the direct text of the current node,//text()Get the text fragment of the current node and all descendant nodes (returns a list).
4.2 Various writing methods for extracting attributes
parsel supports multiple styles of attribute extraction, and you can choose according to your own habits.
5. Built-in regular rules to extract complex content in one step
When you want to extract content in a specific format such as mobile phone number, price, serial number, etc. from text or attributes, you can call it directly on the selector.re()or.re_first(), no longer need to write a bunch of post-processing logic yourself.
6. Practical Tips
6.1 Chain call: CSS + XPath hybrid
First use concise CSS to locate large areas, and then use precise XPath to handle internal details. The code is both easy to read and efficient.
6.2 Handle missing values safely
parsel.get(default=…)and.re_first(default=…)Can let you say goodbye completelytry-exceptThe trouble is that the crawler will not be interrupted even if the element does not exist.
6.3 XPath axis operation: locating sibling/ancestor nodes
In complex pages, we often need to find the "brother next door" or "parent container", and XPath axis can easily do this.
7. Seamless migration with Scrapy
parsel is completely consistent with Scrapy's Selector interface. The parsing code written during practice can be directly copied to Scrapy's crawler for use.
8. Summary
parsel is a lightweight, high-speed, and powerful HTML/XML parsing library, especially suitable for crawler development. After reading this article, you only need to remember a few key points to get started immediately:
- Be able to use CSS to write simple positioning, and use XPath axis operations when necessary;
- Remember
get()、get(default=…)andgetall()Three extraction methods; - Make good use of built-in regular rules
.re()、.re_first()Extract complex content; - Migrate the practice code directly to Scrapy and seamlessly connect to the formal project.
If you want to know more details, you can check the official documentation: parsel 官方文档.

