Python crawler parsing library parsel tutorial

If you feel that BeautifulSoup's parsing speed is not fast enough when writing a Python crawler, and you don't want to build a Scrapy framework for an efficient Selector, then the library introduced today is your "savior" - parsel. It is extracted from Scrapy and inherits powerful selector capabilities while remaining minimalist and lightweight, allowing you to get started in a few minutes and write clear and efficient parsing code.

1. What is parsel?

parsel is not a new project. It is the core parsing library officially split from Scrapy. It was originally included in the Scrapy framework.Selector
It has all the advantages of Scrapy Selector, but can be installed and used independently. It is very suitable for:

-Write a small crawler

  • Data cleaning
  • Scenarios where you don’t want to introduce the entire Scrapy but need high-performance parsing

The core highlights of parsel:

  • Dual engine parsing: The bottom layer is based on lxml, supports XPath 1.0 and CSS Selector (even supports CSS→XPath automatic conversion)
  • Three extraction modes: pure CSS, pure XPath, mixed chain call of CSS and XPath, you can cut it however you want
  • Minimalist and secure API:get()getall()get(default=…)Replaced the cumbersome error handling in the past
  • Built-in regular support: No need to take out the content first and then do regular regularization separately, call it directly on the selector.re()or.re_first()
  • Seamless integration with Scrapy: Practice code can be directly moved to ScrapyparseIn the method, zero changes

2. Installation

Only one pip command is needed, and parsel will automatically install the dependent lxml:

pip install parsel

Once installed, you can use it in any Python script.

3. Get started quickly

We use a simulated HTML content to demonstrate the most commonly used extraction method of parsel.

3.1 Create Selector object

After getting the HTML text, useparsel.Selector(text=…)Just pack it. The bottom layer of parsel will automatically handle issues such as closing tags and encoding with the help of lxml.

from parsel import Selector

demo_html = """
<html>
    <body>
        <h1 id="main-title">Parsel 轻量级解析实战</h1>
        <ul id="content-list">
            <li class="content-item item-0">first item</li>
            <li class="content-item item-1"><a href="link2.html">second item</a></li>
            <li class="content-item item-0 active"><a href="link3.html"><span>third item</span></a></li>
            <li class="content-item item-1 active"><a href="link4.html">fourth item</a></li>
            <li class="content-item item-0">fifth item</li>
        </ul>
        <div data-id="12345" class="dynamic-tag">
            <span>隐藏的产品ID是:</span>12345
        </div>
    </body>
</html>
"""

sel = Selector(text=demo_html)

3.2 Extract with CSS selector

If you have a front-end foundation, CSS selectors are the most friendly way, and the writing method is almost the same as how you usually write styles.

# 选中所有带 active 类的 li
active_items = sel.css('.content-item.active')
# 取第一个匹配结果(get() 是 parsel 独有的便捷方法)
print(active_items.get())

# 提取标题文本 ::text 是 parsel 对 CSS 的扩展,专门用来获取节点内的文本
title = sel.css('#main-title::text').get()
print("标题是:", title)

3.3 Extract using XPath

XPath will be more flexible when encountering complex nested relationships and need to locate sibling nodes or ancestor nodes.

# 获取 class 包含 active 且包含 item-0 的 li 下的 a 标签的 href 属性
href_xpath = sel.xpath(
    '//ul[@id="content-list"]/li[contains(@class, "active") and contains(@class, "item-0")]/a/@href'
).get()
print("XPath 提取的链接:", href_xpath)

# 提取所有 li 内部的文本(无论嵌套多少层)
all_raw_text = sel.xpath('//ul[@id="content-list"]//text()').getall()
print("所有原始文本列表:", all_raw_text)

4. Core extraction method

Regardless of whether you use CSS or XPath positioning, the final data extraction relies on the following three methods, which are very easy to remember:

MethodFunction
.get()Extract the first matching result, return if there is no matchNone
.get(default=xxx)Extract the first matching result, and return the custom default value when there is no match, No more fear of errors
.getall()Extract all matching results and return a list. If there is no match, return an empty list.[]

4.1 Details of extracting text

  • CSS requires the use of extended syntax::textget text,::attr(属性名)Get attribute value;
  • XPath usage/text()Get the direct text of the current node,//text()Get the text fragment of the current node and all descendant nodes (returns a list).
# CSS 获取直接子节点的文本
direct_text = sel.css('.dynamic-tag > span::text').get()
print("直接子文本:", direct_text)

# 获取所有内部文本(列表),然后拼接成一个干净字符串
raw_list = sel.css('.dynamic-tag ::text').getall()
clean_text = ''.join(raw_list).strip()
print("拼接后的干净文本:", clean_text)

4.2 Various writing methods for extracting attributes

parsel supports multiple styles of attribute extraction, and you can choose according to your own habits.

# CSS 写法
data_id_css = sel.css('.dynamic-tag::attr(data-id)').get()

# XPath 写法
data_id_xpath = sel.xpath('//div[@class="dynamic-tag"]/@data-id').get()

# 混合链写法:先用 CSS 定位,再通过 XPath 取属性
data_id_mix = sel.css('.dynamic-tag').xpath('./@data-id').get()

print("三种方式结果一致:", data_id_css == data_id_xpath == data_id_mix)

5. Built-in regular rules to extract complex content in one step

When you want to extract content in a specific format such as mobile phone number, price, serial number, etc. from text or attributes, you can call it directly on the selector.re()or.re_first(), no longer need to write a bunch of post-processing logic yourself.

# 提取所有 li 中 item- 后面的数字
all_item_nums = sel.css('.content-item').re(r'item-(\d)')
print("所有列表项编号:", all_item_nums)

# 提取第一个 active li 中 item- 后面的数字,未匹配时返回默认值
first_active_num = sel.css('.content-item.active').re_first(
    r'item-(\d)', default='无匹配'
)
print("第一个激活项的编号:", first_active_num)

6. Practical Tips

6.1 Chain call: CSS + XPath hybrid

First use concise CSS to locate large areas, and then use precise XPath to handle internal details. The code is both easy to read and efficient.

# 在 #content-list 里面找到所有带 active 类的 li,再取内部 a 标签的 href
hrefs = sel.css('#content-list li.active').xpath('.//a/@href').getall()
print("激活项的所有链接:", hrefs)

6.2 Handle missing values ​​safely

parsel.get(default=…)and.re_first(default=…)Can let you say goodbye completelytry-exceptThe trouble is that the crawler will not be interrupted even if the element does not exist.

# 查找一个不存在的 li 的 href,返回预设的默认链接
missing_href = sel.css('.content-item.missing::attr(href)').get(
    default='https://example.com'
)
print("安全的默认链接:", missing_href)

6.3 XPath axis operation: locating sibling/ancestor nodes

In complex pages, we often need to find the "brother next door" or "parent container", and XPath axis can easily do this.

# 第一个 li 后面所有的兄弟 li
following_li = sel.css('#content-list li:first-child').xpath(
    './following-sibling::li'
).getall()
print("第一个 li 后面的兄弟个数:", len(following_li))

# 包含 span 的 a 标签的父级 li
parent_li = sel.css('#content-list li span').xpath(
    './ancestor::li[1]'
).get()
print("带 span 的父级 li:", parent_li)

7. Seamless migration with Scrapy

parsel is completely consistent with Scrapy's Selector interface. The parsing code written during practice can be directly copied to Scrapy's crawler for use.

# 练手时的 parsel 代码
# from parsel import Selector
# sel = Selector(text=demo_html)
# hrefs = sel.css('#content-list li.active').xpath('.//a/@href').getall()

# 在 Scrapy 的 Spider 里(response 自带 selector,无需手动创建)
def parse(self, response):
    hrefs = response.css('#content-list li.active').xpath('.//a/@href').getall()
    for href in hrefs:
        yield response.follow(href, callback=self.parse_detail)

8. Summary

parsel is a lightweight, high-speed, and powerful HTML/XML parsing library, especially suitable for crawler development. After reading this article, you only need to remember a few key points to get started immediately:

  1. Be able to use CSS to write simple positioning, and use XPath axis operations when necessary;
  2. Rememberget()get(default=…)andgetall()Three extraction methods;
  3. Make good use of built-in regular rules.re().re_first()Extract complex content;
  4. Migrate the practice code directly to Scrapy and seamlessly connect to the formal project.

If you want to know more details, you can check the official documentation: parsel 官方文档.