Spider Practical Guide - Request, Response, yield in-depth analysis and crawler logic implementation

📂 Stage: Stage 1 - fledgling (core framework) 🔗 Related chapters: 创建你的首个工程 · Selector 选择器

Spider is the soul of the Scrapy framework. It determines where the crawler goes, how it parses, and what data it extracts. If you compare Scrapy to a car, Spider is the driver - controlling the steering wheel, accelerator and brakes. This article will provide an in-depth analysis of several core elements of Spider to help you write a more efficient and robust crawler.

Table of contents

Spider infrastructure

Every spider isscrapy.SpiderA subclass of , comes with a complete life cycle. Below is a minimalist template that includes all the essentials.

Basic Spider template

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'                     # 爬虫名称,启动时要用
    allowed_domains = ['example.com']    # 允许爬取的域名
    start_urls = ['http://example.com']  # 起始URL列表

    def parse(self, response):
        # 第一步:提取我们想要的数据
        for item in response.css('div.item'):
            yield {
                'title': item.css('h2::text').get(),
                'price': item.css('span.price::text').get(),
                'url': response.url
            }

        # 第二步:找到“下一页”链接,继续爬取
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Emphasis added:

  • name: The unique ID of the crawler, passed when startingscrapy crawl nameto tell Scrapy which one to use.
  • allowed_domains: After setting, Scrapy will automatically filter out requests outside the domain name to prevent crawlers from going astray.
  • start_urls: The first requested page when starting, Scrapy will encapsulate it intoRequestSent out, the response is given by default toparsemethod processing.
  • custom_settings: If you want to set concurrency, delay and other parameters separately for a spider, you can assign it an exclusive configuration dictionary, which has higher priority than global settings.

Detailed explanation of Request

RequestThe object represents an HTTP request. Not only do we need it at startup, but we also use it to "place orders" when new links are discovered during operation.

Basic usage of Request

import scrapy

class RequestSpider(scrapy.Spider):
    name = 'request_example'

    def start_requests(self):
        # 1. 普通GET请求
        yield scrapy.Request(
            url='http://example.com/api/data',
            callback=self.parse_get_data,
            headers={
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
                'Authorization': 'Bearer token123'
            },
            meta={'request_type': 'api_call'}   # meta像是随身携带的“便签”,可以往后传递自定义信息
        )

        # 2. POST请求
        yield scrapy.Request(
            url='http://example.com/login',
            callback=self.parse_login,
            method='POST',
            headers={'Content-Type': 'application/json'},
            body='{"username": "user", "password": "pass"}'
        )

    def parse_get_data(self, response):
        # 根据Content-Type判断响应格式
        content_type = response.headers.get('Content-Type', b'').decode()
        data = response.json() if 'application/json' in content_type else response.text
        yield {'api_data': data}

Quick check of core parameters:

  • url: The page address to be requested (required).
  • callback:Which function to hand over the response to after the download is completed; if not specified, it will be called by defaultparse
  • method'GET'(default) or'POST', you can even set'PUT''DELETE'wait.
  • headers: Carrying custom request headers, such as UA, Cookie, Token.
  • body: Used to transfer data during POST requests, either strings or bytes.
  • meta: A dictionary, data will flow from the request to the response, which is very suitable for passing context such as page numbers and category names across functions.
  • dont_filter=True: You can force Scrapy not to check whether this URL has been crawled. It is often used for interfaces that require repeated requests.

Response detailed explanation

When the request is successfully downloaded, Scrapy will construct aResponseThe object is handed over to us. This object is the main battlefield for parsing data.

Response common properties and methods

def parse_response_attributes(self, response):
    # ---- 基本属性 ----
    url = response.url                # 当前响应的最终URL(可能发生过重定向)
    status = response.status          # HTTP状态码,200表示成功
    headers = response.headers        # 响应头,类似一个字典
    text = response.text              # 响应正文(字符串)
    meta = response.meta              # 从对应Request继承来的meta数据

    # ---- 数据提取 ----
    titles = response.css('h1::text').getall()        # CSS选择器
    links = response.xpath('//a/@href').getall()      # XPath选择器

    # ---- 链接处理 ----
    next_page = response.follow('next.html', callback=self.parse_next)  # 省去拼接URL的麻烦
    absolute_url = response.urljoin('/relative/path')  # 手动把相对路径变绝对路径

    # ---- JSON响应 ----
    content_type = response.headers.get('Content-Type', b'').decode()
    if 'application/json' in content_type:
        json_data = response.json()

Tips:

  • The selector returns selector list, which must be called.get()(take the first one) or.getall()(Take all) to get the real text.
  • When processing JSON that requires login or API return, use it firstjson()Convert the response into a Python dictionary, which is very easy to parse.
  • response.urljoin()andresponse.follow()The relative path will be resolved based on the current page address. The latter is more concise and is recommended to be used first.

Tips for using yield

In Scrapy,yieldis a super important keyword. It turns the function into a generator, allowing us to output data or requests one after another instead of returning all results at once. The advantage of this is that it is memory friendly and facilitates asynchronous processing.

Three major uses of yield

def parse_with_yield_examples(self, response):
    # ① 产出数据(dict或Item对象)
    yield {
        'title': response.css('h1::text').get(),
        'url': response.url
    }

    # ② 产出新的Request(手动构造)
    next_page = response.css('a.next::attr(href)').get()
    if next_page:
        yield scrapy.Request(
            url=response.urljoin(next_page),
            callback=self.parse,
            meta={'page': response.meta.get('page', 1) + 1}
        )

    # ③ 产出response.follow的结果(更简洁的Request)
    product_links = response.css('a.product-link::attr(href)').getall()
    for link in product_links:
        yield response.follow(link, callback=self.parse_product)

**Why not use return? ** If there are many products in the function, usereturn [dict1, dict2, ...]A large list will be created at once, filling up the memory; andyieldOne is processed one by one, and the data is sent to subsequent Pipelines or files like running water. This is one of the secrets of Scrapy's high performance.

Crawler parsing logic

Real websites often have a hierarchical structure: category page → list page → details page. Scrapy's multi-level parsing is implemented through callback function chains. In the middle,metaPass context.

Multi-level parsing example: Category → Product List → Product Details

class MultiLevelParsingSpider(scrapy.Spider):
    name = 'multi_level_parsing'

    def start_requests(self):
        yield scrapy.Request('http://example.com/categories',
                             callback=self.parse_categories)

    def parse_categories(self, response):
        """第一层:解析分类页面,获取所有分类链接"""
        for category_link in response.css('a.category-link::attr(href)').getall():
            yield response.follow(
                category_link,
                callback=self.parse_products,
                meta={'category': response.css('h1::text').get()}
            )

    def parse_products(self, response):
        """第二层:解析产品列表页,抓取产品链接并处理翻页"""
        category = response.meta['category']

        # 产品链接
        for product_link in response.css('a.product-link::attr(href)').getall():
            yield response.follow(
                product_link,
                callback=self.parse_product_detail,
                meta={'category': category}
            )

        # 翻页 — 注意meta要继续往下传
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page,
                                  callback=self.parse_products,
                                  meta={'category': category})

    def parse_product_detail(self, response):
        """第三层:详情页,提取完整信息并产出最终数据"""
        yield {
            'category': response.meta['category'],
            'url': response.url,
            'name': response.css('h1.product-title::text').get(),
            'price': response.css('.price::text').get(),
            'description': response.css('.description::text').get()
        }

Important reminder:

  • metaThe dictionary will be passed all the way in the callback chain by default, as long as you are generatingRequestWhen you set it, you can get it in the response.
  • When processing paging, be sure to change the currentmetaPass it to the request for the next page as it is, otherwise the context will be lost and the data will be "disconnected".

Scrapy provides a variety of ways to help us handle new links. Choosing the right method can make the code much cleaner.

response.follow() vs scrapy.Request

def comparison_of_link_following(self, response):
    # ✅ 方法1:response.follow() — 自动处理相对URL,推荐
    next_page = response.css('a.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, callback=self.parse)

    # ✅ 方法2:scrapy.Request — 需要手动调用urljoin,更灵活
    next_page = response.css('a.next::attr(href)').get()
    if next_page:
        yield scrapy.Request(url=response.urljoin(next_page),
                             callback=self.parse)
  • response.follow()URL splicing is automatically done internally, and the code is cleaner. It is your "default choice".
  • scrapy.RequestSuitable for unconventional scenarios, such as those that require customizationerrback, set specialmetaor forcedont_filterhour.

When the page is complex and you want to extract links in batches based on rules,LinkExtractorIt's a sharp weapon.

from scrapy.linkextractors import LinkExtractor

class AdvancedLinkExtractionSpider(scrapy.Spider):
    name = 'advanced_links'

    def parse(self, response):
        # 定义一个提取器:只抓分类页链接,排除后台管理页面
        link_extractor = LinkExtractor(
            allow=r'/category/\w+',        # 允许的URL正则模式
            deny=r'/admin/',               # 需要忽略的模式
            restrict_css='.main-content'   # 只在主内容区查找链接
        )

        links = link_extractor.extract_links(response)
        for link in links:
            yield response.follow(link.url, callback=self.parse_category)

Data extraction technology

The extracted raw data often contains impurities - spaces, garbled characters, meaningless default values... Cleaning them up before output can make subsequent data processing more efficient.

Data cleaning and verification

import re

class DataCleaningSpider(scrapy.Spider):
    name = 'data_cleaning'

    def parse(self, response):
        for product in response.css('div.product'):
            raw_title = product.css('.title::text').get()
            raw_price = product.css('.price::text').get()

            # 清洁数据
            cleaned_title = self.clean_text(raw_title)
            cleaned_price = self.clean_price(raw_price)

            # 只产出合格的数据
            if self.validate_data(cleaned_title, cleaned_price):
                yield {
                    'title': cleaned_title,
                    'price': cleaned_price,
                    'url': response.url
                }

    def clean_text(self, text):
        """去空白、合并空格"""
        if not text:
            return ''
        return re.sub(r'\s+', ' ', text.strip())

    def clean_price(self, price_str):
        """从 '¥99.00元' 这样的字符串里提取数字"""
        if not price_str:
            return None
        numbers = re.findall(r'\d+\.?\d*', price_str.replace(',', ''))
        return float(numbers[0]) if numbers else None

    def validate_data(self, title, price):
        if not title or len(title) < 2:
            return False
        if price is None or price <= 0:
            return False
        return True

Best Practices:

  • Encapsulate the cleaning logic into a separate method, keepingparseThe function is concise and clear.
  • Develop the habit of cleaning first and then verifying. Only data that passes verification is output. Dirty data is directly discarded or logged.
  • Recommended when using regular expressionsre.subandre.findall, they are very reliable in processing all kinds of weird web data.

Error handling and exception catching

Network requests cannot be 100% successful, and 404, 500, and connection timeouts may occur at any time. Scrapy provideserrbackLet's handle failed requests gracefully.

Put an "airbag" on requests

class ErrorHandlingSpider(scrapy.Spider):
    name = 'error_handling'

    def start_requests(self):
        urls = [
            'http://example.com/valid-page',
            'http://example.com/404-page',
        ]
        for url in urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                errback=self.handle_error      # 出错时走这里
            )

    def parse(self, response):
        if response.status == 200:
            yield {
                'url': response.url,
                'status': response.status,
                'title': response.css('title::text').get(),
                'success': True
            }

    def handle_error(self, failure):
        """错误处理函数会收到一个Failure对象"""
        request = failure.request
        # 记录日志,方便追踪问题
        self.logger.error(f"请求失败:{request.url},错误:{failure.value}")

        # 也可以产出失败记录,用于事后排查
        yield {
            'url': request.url,
            'error': str(failure.value),
            'success': False
        }

**What else to do? **

  • For partial failures, you canerrbackregenerate an identicalRequest(Note the settingsdont_filter=True) to implement simple retry.
  • Cooperate with middleware (such as RetryMiddleware) to handle retry strategies more systematically.

💡 Core Points: Spider is the core of Scrapy. Mastering Request, Response, and Yield will grasp the lifeblood of the crawler. Add data cleaning, error handling and multi-level analysis, and you can write a stable and maintainable crawler system. Remember: if the code is well written, the boss will get off work early!

🔗 Recommended related tutorials