Complete Guide to Selector - Detailed Explanation of CSS and XPath Data Extraction Technology
📂 Stage: Stage 1 - fledgling (core framework)
🔗 Related chapters: Spider 实战 · Item 与 Item Loader
Table of contents
Selector basic concept
In Scrapy crawler, Selector is responsible for finding the data we want from the web page. It has two built-in expression methods: CSS selector and XPath expression, which can cover almost all positioning requirements of HTML/XML documents.
How does Selector work?
The whole process can be broken down into several steps:
- Receive HTML or XML content
- Build a DOM tree internally
- Use the selector you wrote to match nodes in the tree
- Get the matched content
- Return the result (text, attribute or node itself)
Get Selector from Response
In Scrapy, the Response object has already prepared the Selector for you, so there is no need to create a new one manually. There are two commonly used methods:
def relationship_with_response(response):
"""
演示Response与Selector的关系
"""
# 方式一(推荐):直接调用response.css()或response.xpath()
title1 = response.css('h1::text').get()
title2 = response.xpath('//h1/text()').get()
# 方式二:通过response.selector访问
selector = response.selector
title3 = selector.css('h1::text').get()
title4 = selector.xpath('//h1/text()').get()
# 两种方式结果等价
return {
'title': title1,
'equivalent': title1 == title3
}
It is recommended to use it firstresponse.css()andresponse.xpath(), the writing is more concise.
Detailed explanation of CSS selector
CSS selectors are very friendly to front-end developers and are equally easy to use in Scrapy. Its advantage lies in its intuitive syntax and high readability.
Commonly used CSS selector types
1. Basic selector
2. Combination selector
3. Pseudo-class selector
Commonly used pseudo-classes can help you filter elements at specific positions:
# 第一个子元素
response.css('.product:first-child .name::text').get()
# 最后一个
response.css('.product:last-child .name::text').get()
# 第n个(例如第3个)
response.css('.product:nth-child(3) .name::text').get()
# 排除某些类
response.css('.product:not(.sold-out) .name::text').getall()
CSS selector practical example
def css_selectors_practical(response):
"""
CSS选择器实战用法
"""
results = {}
# 提取多重标题的文本
results['titles'] = response.css('h1, h2, h3::text').getall()
# 提取所有链接的href属性
results['links'] = response.css('a::attr(href)').getall()
# 只提取外部链接(http/https开头)
results['external_links'] = response.css(
'a[href^="http://"], a[href^="https://"]::attr(href)'
).getall()
# 提取第一个产品的名称
results['first_product'] = response.css(
'.product:first-child .name::text'
).get()
return results
Detailed explanation of XPath selector
XPath is a query language designed for XML. It is more powerful than CSS when processing complex HTML - especially when filtering based on text content, attributes or node relationships is required.
XPath basic syntax
A few common path examples:
# 文档里所有div
//div
# 根节点下的div
/div
# class为"product"的div
//div[@class="product"]
# 第一个div
//div[1]
# 最后一个div
//div[last()]
XPath axis selection
Axes allow you to jump flexibly between nodes:
Actual usage:
# 获取当前节点的父节点
response.xpath('//h2/parent::*')
# h2后面的第一个同级p
response.xpath('//h2/following-sibling::p[1]/text()')
XPath common functions
# 包含某个值
//div[contains(@class, "product")]
# 以某字符串开头
//a[starts-with(@href, "http://")]
# 去除多余空格
normalize-space(//h1/text())
# 元素位置
//div[position()=2]
# 计数
count(//div[@class="item"])
XPath practical example
def xpath_selectors_practical(response):
"""
XPath选择器实战用法
"""
results = {}
# 合并多个标题
results['headings'] = response.xpath(
'//h1/text() | //h2/text() | //h3/text()'
).getall()
# 提取所有外部链接
results['external_links'] = response.xpath(
'//a[starts-with(@href, "http://") or starts-with(@href, "https://")]/@href'
).getall()
# 提取h2后的第一个段落文本
results['next_p'] = response.xpath(
'//h2/following-sibling::p[1]/text()'
).getall()
# 提取带有折扣标签的产品名称
results['discount_products'] = response.xpath(
'//div[@class="product" and .//span[@class="discount"]]//h3/text()'
).getall()
return results
get and getall methods
When extracting data, the two most common methods areget()andgetall(), their behavior is significantly different.
get() method
get()Only the first matching result is returned. If there is no match, returnNone, you can also set a default value.
def get_method_example(response):
"""
get()方法示例
"""
# 获取第一个h1文本,可能为None
title = response.css('h1::text').get()
# 获取第一个链接,无匹配时返回自定义字符串
first_link = response.css('a::attr(href)').get(default='No link found')
return {
'title': title,
'first_link': first_link
}
getall() method
getall()Returns a list of all matching results. Even if nothing is found, an empty list will be returned[]。
def getall_method_example(response):
"""
getall()方法示例
"""
# 获取所有链接href
all_links = response.css('a::attr(href)').getall()
# 毫无匹配时返回[]
no_match = response.css('.nonexistent::text').getall()
return {
'all_links': all_links,
'no_match': no_match
}
get()Stops after the first match is found, so is usually better thangetall()Faster. Especially in large documents, the difference is more obvious.
import time
def performance_test(response):
"""
get() vs getall() 性能对比
"""
# 测试get()耗时
start = time.time()
for _ in range(1000):
response.css('div.item h2::text').get()
get_time = time.time() - start
# 测试getall()耗时
start = time.time()
for _ in range(1000):
response.css('div.item h2::text').getall()
getall_time = time.time() - start
return {
'get_time': get_time,
'getall_time': getall_time,
'get_is_faster': get_time < getall_time
}
Advanced selector skills
Mixing CSS and XPath
You can first use CSS to quickly lock the area, and then use XPath for detailed extraction, and vice versa.
def mixed_selectors(response):
"""
混合使用CSS和XPath
"""
results = {}
# 先用CSS定位产品块,再用相对XPath取子元素
results['css_then_xpath'] = response.css('.product').xpath('./h2/text()').getall()
# 先用XPath定位,再用CSS取价格
results['xpath_then_css'] = response.xpath('//div[@class="item"]').css('.price::text').getall()
return results
Nested selector processing list
For tabular data, the most robust approach is to select all "rows" first, and then extract fields in each "row".
def nested_extraction(response):
"""
嵌套提取示例
"""
products = []
# 选取所有产品容器
for product in response.css('.product'):
item = {
'name': product.css('.name::text').get(),
'price': product.css('.price::text').get(),
'url': product.css('a::attr(href)').get()
}
products.append(item)
return products
Robust selector strategy
The structure of web pages changes frequently, and preparing multiple alternative selectors can improve the survival rate of crawlers.
def robust_extraction(response):
"""
健壮的提取策略
"""
def extract_with_fallbacks(selectors):
"""依次尝试多个选择器,返回第一个有效结果"""
for sel in selectors:
try:
if sel.startswith('xpath:'):
result = response.xpath(sel[6:]).get()
else:
result = response.css(sel).get()
if result and result.strip():
return result.strip()
except:
continue
return None
# 为标题准备多个可能的路径
title_selectors = [
'h1.product-title::text',
'h1::text',
'title::text',
'xpath://h1/text()',
'xpath://title/text()'
]
return {
'title': extract_with_fallbacks(title_selectors)
}
##Performance Optimization Strategy {#Performance Optimization Strategy}
The more specific the selector, the faster it will be
Broad selectors (such as*) will cause the engine to scan a large number of nodes. Try to specify the tag name, class name and other qualifications.
def optimized_selectors(response):
"""
优化选择器性能
"""
# 推荐:具体的选择器
good = response.css('div.product.highlighted .name::text').get()
# 不推荐:过于宽泛
# bad = response.css('*[class*="product"] *::text').get()
return good
Batch processing reduces DOM traversal
First select the parent container at once, and then extract the subfields inside it to avoid repeatedly scanning the entire tree.
def batch_processing(response):
"""
批量处理示例
"""
# 高效:一次选出所有产品div,然后循环提取
products = response.css('div.product')
data = []
for product in products:
data.append({
'name': product.css('.name::text').get(),
'price': product.css('.price::text').get()
})
# 低效:分别对整页执行两次全文档扫描
# names = response.css('div.product .name::text').getall()
# prices = response.css('div.product .price::text').getall()
return data
Practical application scenario
Below is a configurable extractor that automatically tries multiple alternative selectors and handles the specification parameters individually.
class ProductExtractor:
"""电商产品数据提取器"""
def __init__(self):
# 为每个字段定义多个可选选择器
self.selectors = {
'name': ['h1.product-title::text', 'h1::text'],
'price': ['.price::text', '.current-price::text'],
'description': ['.product-detail::text', '.description::text'],
'images': ['.gallery img::attr(src)', '.product-image::attr(src)']
}
def extract(self, response):
"""提取产品数据"""
product = {}
for field, sels in self.selectors.items():
product[field] = self._extract_with_fallbacks(response, sels)
# 特殊处理:提取规格参数
product['specs'] = self._extract_specs(response)
return product
def _extract_with_fallbacks(self, response, selectors):
"""尝试多个选择器"""
for sel in selectors:
result = response.css(sel).get()
if result and result.strip():
return result.strip()
return None
def _extract_specs(self, response):
"""提取规格参数表格"""
specs = {}
for row in response.css('.spec-table tr'):
key = row.css('td:first-child::text').get()
value = row.css('td:last-child::text').get()
if key and value:
specs[key.strip()] = value.strip()
return specs
News article content extraction
News pages have diverse structures, and the "container + fallback" strategy can be adapted to most sites.
class NewsExtractor:
"""新闻文章内容提取器"""
def extract(self, response):
"""提取新闻内容"""
return {
'title': self._extract_title(response),
'author': self._extract_author(response),
'date': self._extract_date(response),
'content': self._extract_content(response),
'tags': self._extract_tags(response)
}
def _extract_title(self, response):
selectors = ['h1.article-title::text', 'h1::text', 'title::text']
return self._try_selectors(response, selectors)
def _extract_author(self, response):
selectors = ['.author::text', '.byline::text']
author = self._try_selectors(response, selectors)
if author:
author = author.replace('作者:', '').replace('By ', '')
return author
def _extract_date(self, response):
selectors = ['.publish-date::text', 'time::text']
return self._try_selectors(response, selectors)
def _extract_content(self, response):
"""提取正文内容"""
containers = ['.article-content', '.content', '.post-content']
for container in containers:
paragraphs = response.css(f'{container} p::text').getall()
if paragraphs and len(paragraphs) > 2:
return '\n'.join(p.strip() for p in paragraphs if p.strip())
return None
def _extract_tags(self, response):
"""提取标签并去重"""
tags = response.css('.tag::text, .tags a::text').getall()
return list(set(tag.strip() for tag in tags if tag.strip()))
def _try_selectors(self, response, selectors):
"""依次尝试选择器,返回第一个非空结果"""
for sel in selectors:
result = response.css(sel).get()
if result and result.strip():
return result.strip()
return None
Frequently Asked Questions and Solutions
Problem 1: There are a lot of white spaces before and after the extracted text
- Method 1: Using Python
strip()
- Method 2: Use XPath
normalize-space()
# Python清理
text = response.css('h1::text').get()
clean_text = text.strip() if text else ''
# XPath一步到位
clean_text = response.xpath('normalize-space(//h1/text())').get()
Question 2: Need to retain the content of HTML tags
In some scenarios, what you need to get is not plain text, but a fragment containing HTML.
# 直接获取内部HTML
html_content = response.css('.content').get()
# 反过来,仅要纯文本(去掉所有标签)
plain_text = response.xpath('string(//div[@class="content"])').get()
Problem 3: The selector cannot find the element at all
There are three possible reasons:
- Dynamic Rendering: The data is loaded after JavaScript. Check the page source code to confirm.
- Wrong selector: Instant debugging in Scrapy shell.
- Page structure has changed: Check and maintain the selector list regularly.
For dynamic content, prepare to switch solutions (such as Selenium, Playwright) or direct analysis interfaces.
##BEST PRACTICE Suggestions {#BEST PRACTICE Suggestions}
Principles for writing selectors
- The more specific the better: Avoid
*Or a path that is too broad.
- Prepare multiple alternative selectors for key fields: prevent the other party from revising.
- Simple scenarios give priority to CSS, and complex logic uses XPath.
- Centralize management of selectors: Place it at the beginning of the configuration file or class for easy modification.
Error handling strategy
- Always assume that the result of the extraction could be
None。
- Make good use of
get()ofdefaultparameter.
- Record the number and location of failed extractions to facilitate troubleshooting.
- Implement a degradation scheme: when the primary selector fails, automatically switch to the backup selector.
💡 Core Points: Selector is the foundation of Scrapy data extraction. If you are familiar with CSS and XPath and use them well, your crawler will become accurate and stable. Reasonable preparation of alternative plans can greatly reduce the maintenance workload.
🔗 Recommended related tutorials
🏷️ tag cloud:Scrapy Selector CSS选择器 XPath 数据提取 爬虫框架 Python爬虫