Complete Guide to Selector - Detailed Explanation of CSS and XPath Data Extraction Technology

📂 Stage: Stage 1 - fledgling (core framework) 🔗 Related chapters: Spider 实战 · Item 与 Item Loader

Table of contents

Selector basic concept

In Scrapy crawler, Selector is responsible for finding the data we want from the web page. It has two built-in expression methods: CSS selector and XPath expression, which can cover almost all positioning requirements of HTML/XML documents.

How does Selector work?

The whole process can be broken down into several steps:

  1. Receive HTML or XML content
  2. Build a DOM tree internally
  3. Use the selector you wrote to match nodes in the tree
  4. Get the matched content
  5. Return the result (text, attribute or node itself)

Get Selector from Response

In Scrapy, the Response object has already prepared the Selector for you, so there is no need to create a new one manually. There are two commonly used methods:

def relationship_with_response(response):
    """
    演示Response与Selector的关系
    """
    # 方式一(推荐):直接调用response.css()或response.xpath()
    title1 = response.css('h1::text').get()
    title2 = response.xpath('//h1/text()').get()
    
    # 方式二:通过response.selector访问
    selector = response.selector
    title3 = selector.css('h1::text').get()
    title4 = selector.xpath('//h1/text()').get()
    
    # 两种方式结果等价
    return {
        'title': title1,
        'equivalent': title1 == title3
    }

It is recommended to use it firstresponse.css()andresponse.xpath(), the writing is more concise.

Detailed explanation of CSS selector

CSS selectors are very friendly to front-end developers and are equally easy to use in Scrapy. Its advantage lies in its intuitive syntax and high readability.

Commonly used CSS selector types

1. Basic selector

Selector typeWriting exampleDescription
Element SelectordivSelect all<div>tag
class selector.productSelect class containsproductElements
Multi-category selector.hot.itemAt the same timehotanditemTwo classes
ID selector#mainid ismainElements
attribute selector[href]All elements with href attribute
Attribute value matching[href="https://example.com"]href is exactly equal to the address
Attribute contains match[class*="product"]The class attribute value containsproduct

2. Combination selector

RelationshipWritingDescription
Descendantsdiv pp elements at all levels within the div
Offspringdiv > pdirect child element p of div
Neighbor Brothersh1 + pThe first p immediately after h1
Universal Brothersh1 ~ pAll sibling p elements after h1

3. Pseudo-class selector

Commonly used pseudo-classes can help you filter elements at specific positions:

# 第一个子元素
response.css('.product:first-child .name::text').get()
# 最后一个
response.css('.product:last-child .name::text').get()
# 第n个(例如第3个)
response.css('.product:nth-child(3) .name::text').get()
# 排除某些类
response.css('.product:not(.sold-out) .name::text').getall()

CSS selector practical example

def css_selectors_practical(response):
    """
    CSS选择器实战用法
    """
    results = {}
    
    # 提取多重标题的文本
    results['titles'] = response.css('h1, h2, h3::text').getall()
    
    # 提取所有链接的href属性
    results['links'] = response.css('a::attr(href)').getall()
    
    # 只提取外部链接(http/https开头)
    results['external_links'] = response.css(
        'a[href^="http://"], a[href^="https://"]::attr(href)'
    ).getall()
    
    # 提取第一个产品的名称
    results['first_product'] = response.css(
        '.product:first-child .name::text'
    ).get()
    
    return results

Detailed explanation of XPath selector

XPath is a query language designed for XML. It is more powerful than CSS when processing complex HTML - especially when filtering based on text content, attributes or node relationships is required.

XPath basic syntax

WritingMeaning
/Start from the root node
//Start anywhere
.current node
..parent node
@Properties
text()The text content of the node

A few common path examples:

# 文档里所有div
//div

# 根节点下的div
/div

# class为"product"的div
//div[@class="product"]

# 第一个div
//div[1]

# 最后一个div
//div[last()]

XPath axis selection

Axes allow you to jump flexibly between nodes:

Axis nameFunction
parent::*parent element
child::*child elements
ancestor::*All Ancestors
descendant::*All Descendants
following-sibling::*The following sibling nodes
preceding-sibling::*Previous sibling node

Actual usage:

# 获取当前节点的父节点
response.xpath('//h2/parent::*')

# h2后面的第一个同级p
response.xpath('//h2/following-sibling::p[1]/text()')

XPath common functions

# 包含某个值
//div[contains(@class, "product")]

# 以某字符串开头
//a[starts-with(@href, "http://")]

# 去除多余空格
normalize-space(//h1/text())

# 元素位置
//div[position()=2]

# 计数
count(//div[@class="item"])

XPath practical example

def xpath_selectors_practical(response):
    """
    XPath选择器实战用法
    """
    results = {}
    
    # 合并多个标题
    results['headings'] = response.xpath(
        '//h1/text() | //h2/text() | //h3/text()'
    ).getall()
    
    # 提取所有外部链接
    results['external_links'] = response.xpath(
        '//a[starts-with(@href, "http://") or starts-with(@href, "https://")]/@href'
    ).getall()
    
    # 提取h2后的第一个段落文本
    results['next_p'] = response.xpath(
        '//h2/following-sibling::p[1]/text()'
    ).getall()
    
    # 提取带有折扣标签的产品名称
    results['discount_products'] = response.xpath(
        '//div[@class="product" and .//span[@class="discount"]]//h3/text()'
    ).getall()
    
    return results

get and getall methods

When extracting data, the two most common methods areget()andgetall(), their behavior is significantly different.

get() method

get()Only the first matching result is returned. If there is no match, returnNone, you can also set a default value.

def get_method_example(response):
    """
    get()方法示例
    """
    # 获取第一个h1文本,可能为None
    title = response.css('h1::text').get()
    
    # 获取第一个链接,无匹配时返回自定义字符串
    first_link = response.css('a::attr(href)').get(default='No link found')
    
    return {
        'title': title,
        'first_link': first_link
    }

getall() method

getall()Returns a list of all matching results. Even if nothing is found, an empty list will be returned[]

def getall_method_example(response):
    """
    getall()方法示例
    """
    # 获取所有链接href
    all_links = response.css('a::attr(href)').getall()
    
    # 毫无匹配时返回[]
    no_match = response.css('.nonexistent::text').getall()
    
    return {
        'all_links': all_links,
        'no_match': no_match
    }

Performance comparison

get()Stops after the first match is found, so is usually better thangetall()Faster. Especially in large documents, the difference is more obvious.

import time

def performance_test(response):
    """
    get() vs getall() 性能对比
    """
    # 测试get()耗时
    start = time.time()
    for _ in range(1000):
        response.css('div.item h2::text').get()
    get_time = time.time() - start
    
    # 测试getall()耗时
    start = time.time()
    for _ in range(1000):
        response.css('div.item h2::text').getall()
    getall_time = time.time() - start
    
    return {
        'get_time': get_time,
        'getall_time': getall_time,
        'get_is_faster': get_time < getall_time
    }

Advanced selector skills

Mixing CSS and XPath

You can first use CSS to quickly lock the area, and then use XPath for detailed extraction, and vice versa.

def mixed_selectors(response):
    """
    混合使用CSS和XPath
    """
    results = {}
    
    # 先用CSS定位产品块,再用相对XPath取子元素
    results['css_then_xpath'] = response.css('.product').xpath('./h2/text()').getall()
    
    # 先用XPath定位,再用CSS取价格
    results['xpath_then_css'] = response.xpath('//div[@class="item"]').css('.price::text').getall()
    
    return results

Nested selector processing list

For tabular data, the most robust approach is to select all "rows" first, and then extract fields in each "row".

def nested_extraction(response):
    """
    嵌套提取示例
    """
    products = []
    
    # 选取所有产品容器
    for product in response.css('.product'):
        item = {
            'name': product.css('.name::text').get(),
            'price': product.css('.price::text').get(),
            'url': product.css('a::attr(href)').get()
        }
        products.append(item)
    
    return products

Robust selector strategy

The structure of web pages changes frequently, and preparing multiple alternative selectors can improve the survival rate of crawlers.

def robust_extraction(response):
    """
    健壮的提取策略
    """
    def extract_with_fallbacks(selectors):
        """依次尝试多个选择器,返回第一个有效结果"""
        for sel in selectors:
            try:
                if sel.startswith('xpath:'):
                    result = response.xpath(sel[6:]).get()
                else:
                    result = response.css(sel).get()
                
                if result and result.strip():
                    return result.strip()
            except:
                continue
        return None
    
    # 为标题准备多个可能的路径
    title_selectors = [
        'h1.product-title::text',
        'h1::text',
        'title::text',
        'xpath://h1/text()',
        'xpath://title/text()'
    ]
    
    return {
        'title': extract_with_fallbacks(title_selectors)
    }

##Performance Optimization Strategy {#Performance Optimization Strategy}

The more specific the selector, the faster it will be

Broad selectors (such as*) will cause the engine to scan a large number of nodes. Try to specify the tag name, class name and other qualifications.

def optimized_selectors(response):
    """
    优化选择器性能
    """
    # 推荐:具体的选择器
    good = response.css('div.product.highlighted .name::text').get()
    
    # 不推荐:过于宽泛
    # bad = response.css('*[class*="product"] *::text').get()
    
    return good

Batch processing reduces DOM traversal

First select the parent container at once, and then extract the subfields inside it to avoid repeatedly scanning the entire tree.

def batch_processing(response):
    """
    批量处理示例
    """
    # 高效:一次选出所有产品div,然后循环提取
    products = response.css('div.product')
    data = []
    for product in products:
        data.append({
            'name': product.css('.name::text').get(),
            'price': product.css('.price::text').get()
        })
    
    # 低效:分别对整页执行两次全文档扫描
    # names = response.css('div.product .name::text').getall()
    # prices = response.css('div.product .price::text').getall()
    
    return data

Practical application scenario

E-commerce product data extraction

Below is a configurable extractor that automatically tries multiple alternative selectors and handles the specification parameters individually.

class ProductExtractor:
    """电商产品数据提取器"""
    
    def __init__(self):
        # 为每个字段定义多个可选选择器
        self.selectors = {
            'name': ['h1.product-title::text', 'h1::text'],
            'price': ['.price::text', '.current-price::text'],
            'description': ['.product-detail::text', '.description::text'],
            'images': ['.gallery img::attr(src)', '.product-image::attr(src)']
        }
    
    def extract(self, response):
        """提取产品数据"""
        product = {}
        
        for field, sels in self.selectors.items():
            product[field] = self._extract_with_fallbacks(response, sels)
        
        # 特殊处理:提取规格参数
        product['specs'] = self._extract_specs(response)
        
        return product
    
    def _extract_with_fallbacks(self, response, selectors):
        """尝试多个选择器"""
        for sel in selectors:
            result = response.css(sel).get()
            if result and result.strip():
                return result.strip()
        return None
    
    def _extract_specs(self, response):
        """提取规格参数表格"""
        specs = {}
        for row in response.css('.spec-table tr'):
            key = row.css('td:first-child::text').get()
            value = row.css('td:last-child::text').get()
            if key and value:
                specs[key.strip()] = value.strip()
        return specs

News article content extraction

News pages have diverse structures, and the "container + fallback" strategy can be adapted to most sites.

class NewsExtractor:
    """新闻文章内容提取器"""
    
    def extract(self, response):
        """提取新闻内容"""
        return {
            'title': self._extract_title(response),
            'author': self._extract_author(response),
            'date': self._extract_date(response),
            'content': self._extract_content(response),
            'tags': self._extract_tags(response)
        }
    
    def _extract_title(self, response):
        selectors = ['h1.article-title::text', 'h1::text', 'title::text']
        return self._try_selectors(response, selectors)
    
    def _extract_author(self, response):
        selectors = ['.author::text', '.byline::text']
        author = self._try_selectors(response, selectors)
        if author:
            author = author.replace('作者:', '').replace('By ', '')
        return author
    
    def _extract_date(self, response):
        selectors = ['.publish-date::text', 'time::text']
        return self._try_selectors(response, selectors)
    
    def _extract_content(self, response):
        """提取正文内容"""
        containers = ['.article-content', '.content', '.post-content']
        for container in containers:
            paragraphs = response.css(f'{container} p::text').getall()
            if paragraphs and len(paragraphs) > 2:
                return '\n'.join(p.strip() for p in paragraphs if p.strip())
        return None
    
    def _extract_tags(self, response):
        """提取标签并去重"""
        tags = response.css('.tag::text, .tags a::text').getall()
        return list(set(tag.strip() for tag in tags if tag.strip()))
    
    def _try_selectors(self, response, selectors):
        """依次尝试选择器,返回第一个非空结果"""
        for sel in selectors:
            result = response.css(sel).get()
            if result and result.strip():
                return result.strip()
        return None

Frequently Asked Questions and Solutions

Problem 1: There are a lot of white spaces before and after the extracted text

  • Method 1: Using Pythonstrip()
  • Method 2: Use XPathnormalize-space()
# Python清理
text = response.css('h1::text').get()
clean_text = text.strip() if text else ''

# XPath一步到位
clean_text = response.xpath('normalize-space(//h1/text())').get()

Question 2: Need to retain the content of HTML tags

In some scenarios, what you need to get is not plain text, but a fragment containing HTML.

# 直接获取内部HTML
html_content = response.css('.content').get()

# 反过来,仅要纯文本(去掉所有标签)
plain_text = response.xpath('string(//div[@class="content"])').get()

Problem 3: The selector cannot find the element at all

There are three possible reasons:

  1. Dynamic Rendering: The data is loaded after JavaScript. Check the page source code to confirm.
  2. Wrong selector: Instant debugging in Scrapy shell.
  3. Page structure has changed: Check and maintain the selector list regularly.

For dynamic content, prepare to switch solutions (such as Selenium, Playwright) or direct analysis interfaces.

##BEST PRACTICE Suggestions {#BEST PRACTICE Suggestions}

Principles for writing selectors

  • The more specific the better: Avoid*Or a path that is too broad.
  • Prepare multiple alternative selectors for key fields: prevent the other party from revising.
  • Simple scenarios give priority to CSS, and complex logic uses XPath.
  • Centralize management of selectors: Place it at the beginning of the configuration file or class for easy modification.

Error handling strategy

  • Always assume that the result of the extraction could beNone
  • Make good use ofget()ofdefaultparameter.
  • Record the number and location of failed extractions to facilitate troubleshooting.
  • Implement a degradation scheme: when the primary selector fails, automatically switch to the backup selector.

💡 Core Points: Selector is the foundation of Scrapy data extraction. If you are familiar with CSS and XPath and use them well, your crawler will become accurate and stable. Reasonable preparation of alternative plans can greatly reduce the maintenance workload.


🔗 Recommended related tutorials

🏷️ tag cloud:Scrapy Selector CSS选择器 XPath 数据提取 爬虫框架 Python爬虫