Scrapy e-commerce full-site crawling practical project - building a multi-level e-commerce data capture system from scratch

📂 Stage: Stage 4 - Practical Exercise (Project Development) 🔗 Related chapters: Spider实战 · Selector选择器 · Pipeline管道实战 · 反爬对抗实战

Table of contents

Project background and goals

E-commerce data is the “fuel” for many business decisions—price comparison, product selection, and competitive product monitoring are all inseparable from it. But in reality, e-commerce websites often have complex structures: categories are nested layer by layer, product lists are listed one page after another, and various anti-crawling methods are used to secretly block them. Whole-site crawling is the process of letting crawlers wander all the way from the home page to the details page like a customer, and extract all valuable information.

This project will lead you to build a production-level data capture system. The core capabilities include:

  • Multi-level classification processing: Automatically identify and recursively crawl the classification tree of the website, from the first-level category to hundreds of sub-categories.
  • Deep Page Turning: Intelligently determine page turning links or generate URL parameters to ensure that there are no "missing pages" and to prevent endless loops.
  • Accurate data extraction: Use CSS selectors and regular expressions to extract key fields such as title, price, specifications, pictures, etc. from complex-structured details pages.
  • Anti-crawler countermeasures: Integrate random User‑Agent, request header camouflage, agent rotation and delay strategies to make crawlers more "human-like".
  • Data Quality Assurance: Built-in deduplication, cleaning, and validity verification, the output data can be directly used for analysis.

Project technology stack

  • Crawler Framework: Scrapy 2.x, a proven asynchronous crawler engine.
  • Data processing: Pydantic or Item defines structured data, and cooperates with ItemLoader to automatically clean and convert it.
  • Storage Plan: Choose from MongoDB, MySQL, and even export CSV/JSON files.
  • Agent Management: Self-built proxy middleware, simple and controllable; it can also be connected to extensions such as Scrapy‑Proxy‑Pool.

Once we are ready, we start by analyzing the structure of the target website.

E-commerce website architecture analysis

No matter how fancy a target website looks, its internal structure usually conforms to a classic four-layer model:

  1. Homepage: Top navigation bar, side category menu. Entrances to all subcategories are concentrated here.
  2. Category page: Either display the next level sub-category, or directly display the product list under this category, or both.
  3. List page: It consists of dozens of product cards, usually with paginator and sorting/filtering controls.
  4. Details page: All information about a single product, including title, price, attributes, pictures, reviews, etc.

The crawling process unfolds along this structure:

首页 → 获取分类链接 → 进入分类页 → 提取商品链接 → 进入详情页 → 抓取数据

The real difficulty is often not in the logic itself, but in:

  • The level of classification is not fixed (maybe 2 levels or 4 levels), and a general recursive parser needs to be designed.
  • There are various implementations of page turning: it may be?page=2query parameters, which may also be/page/3/path form, or even a "click to load more" button.
  • The website will quietly detect crawlers: common anti-crawling methods include checking User-Agent, limiting request frequency, blocking IP, etc.

Before we start coding, we first plan the overall structure of the project.

Project Architecture Design

Clear directory division will make subsequent development smoother. The following structure is our recommended layout:

ecommerce_spider/
├── ecommerce_spider/
   ├── spiders/         # 爬虫文件(核心逻辑)
   ├── items/           # 数据模型定义
   ├── pipelines/       # 管道(清洗、去重、存储)
   ├── middlewares/     # 中间件(反爬、代理)
   ├── utils/           # 工具类(翻页处理、分类解析等)
   └── settings.py      # 全局配置
├── scrapy.cfg           # 项目配置文件
└── requirements.txt     # 依赖列表

Core component relationships

The flowchart below summarizes how the components work in series:

graph TD
    A[起始URL] --> B[分类解析模块]
    B --> C[列表页解析模块]
    C --> D[详情页解析模块]
    D --> E[数据清洗管道]
    E --> F[数据存储]
    G[反爬中间件] -.-> B
    H[代理中间件] -.-> C

Note: The solid arrows indicate the flow of data/requests, and the dotted lines indicate that the middleware "intercepts" and processes the requests behind the scenes.

Next, we start with the data model and implement each module step by step.

Data model definition

In Scrapy, usually useItemThe class describes the data structure we want to crawl. CooperateItemLoader, which can easily complete cleaning and format conversion when filling data.

# ecommerce_spider/items/product_item.py
import scrapy
from itemloaders.processors import TakeFirst, MapCompose, Join
from w3lib.html import remove_tags
import re

def clean_price(value):
    """清洗价格字符串,转为浮点数。例如 '¥1,299.00' -> 1299.0"""
    if value:
        cleaned = re.sub(r'[^\d.,]', '', value.strip())
        try:
            return float(cleaned.replace(',', ''))
        except ValueError:
            return None
    return None

class ProductItem(scrapy.Item):
    product_id = scrapy.Field(output_processor=TakeFirst())
    title = scrapy.Field(output_processor=TakeFirst())
    brand = scrapy.Field(output_processor=TakeFirst())
    # 分类路径会用 ' > ' 连接多个层级
    category_path = scrapy.Field(output_processor=Join(' > '))
    current_price = scrapy.Field(
        input_processor=MapCompose(clean_price),
        output_processor=TakeFirst()
    )
    original_price = scrapy.Field(
        input_processor=MapCompose(clean_price),
        output_processor=TakeFirst()
    )
    main_image = scrapy.Field(output_processor=TakeFirst())
    gallery_images = scrapy.Field()              # 可能有多张图
    rating = scrapy.Field(output_processor=TakeFirst())
    review_count = scrapy.Field(output_processor=TakeFirst())
    specifications = scrapy.Field()              # 商品规格,通常为字典列表
    url = scrapy.Field(output_processor=TakeFirst())
    crawled_at = scrapy.Field()

Interpretation:

  • clean_priceThe function is responsible for turning strings such as "¥1,299.00" into computable floating point numbers.
  • input_processorandoutput_processorDefines automatic processing rules for data entering and exiting fields.
  • picturecategory_pathIn that way, the values ​​of multiple fields with the same name will be automatically concatenated into strings for subsequent storage.

With the data blueprint in hand, we can start writing crawler logic.

Core function implementation

Main crawler framework

The main crawler is the "scheduling center" of the entire system. It is responsible for:

  1. Get all category links from the starting URL;
  2. For each category, request its product list page;
  3. Grab the product details link in the list page and hand it over to the details parsing method;
  4. Control page turning to ensure there is no infinite loop.
# ecommerce_spider/spiders/ecommerce_spider.py
import scrapy
from urllib.parse import urljoin
from ecommerce_spider.items import ProductItem
from datetime import datetime

class EcommerceSpider(scrapy.Spider):
    name = 'ecommerce'
    custom_settings = {
        'DOWNLOAD_DELAY': 2,       # 基础下载延迟,视网站情况调整
        'CONCURRENT_REQUESTS': 8,  # 全局并发请求数
    }

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # 允许通过命令行传入起始URL和最多翻页数
        self.start_urls = kwargs.get('start_urls', ['https://example.com/categories'])
        self.max_pages = int(kwargs.get('max_pages', 100))

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse_categories)

    def parse_categories(self, response):
        """解析分类页,提取所有子分类的链接"""
        category_links = response.css('a.category-link::attr(href)').getall()
        for link in category_links:
            category_url = urljoin(response.url, link)
            yield scrapy.Request(category_url, callback=self.parse_product_list)

    def parse_product_list(self, response):
        """解析商品列表页:提取商品链接,并找到下一页"""
        product_links = response.css('a.product-link::attr(href)').getall()
        for link in product_links:
            product_url = urljoin(response.url, link)
            yield scrapy.Request(
                product_url,
                callback=self.parse_product_detail,
                # 可以传递一些上下文信息,比如分类名
                meta={'category': response.meta.get('category')}
            )

        # 处理翻页
        current_page = response.meta.get('page', 1)
        next_page = response.css('a.next::attr(href)').get()
        if next_page and current_page < self.max_pages:
            next_url = urljoin(response.url, next_page)
            yield scrapy.Request(
                next_url,
                callback=self.parse_product_list,
                meta={'page': current_page + 1}
            )

    def parse_product_detail(self, response):
        """解析商品详情页,构造 ProductItem"""
        item = ProductItem()
        item['product_id'] = self._extract_product_id(response)
        item['title'] = response.css('h1::text').get().strip()
        item['current_price'] = response.css('.price::text').get()
        item['main_image'] = urljoin(
            response.url,
            response.css('.main-image img::attr(src)').get()
        )
        item['url'] = response.url
        item['crawled_at'] = datetime.now().isoformat()
        yield item

    def _extract_product_id(self, response):
        """从URL或页面元素解析商品ID"""
        match = re.search(r'/(\d+)/?$', response.url)
        return match.group(1) if match else 'unknown'

💡 Tip: If you need to flexibly pass context (such as category name) in a certain step, you canmetaAdd corresponding fields to the dictionary, and subsequent methods passresponse.metaJust take it out.

Multi-level classification analysis

E-commerce websites often use "breadcrumbs" to display the category level the user is currently at. We can extract the complete classification path from these elements for subsequent data analysis.

# ecommerce_spider/utils/category_parser.py
from urllib.parse import urljoin

class CategoryParser:
    def parse_hierarchy(self, response):
        """从面包屑导航中解析分类层级"""
        breadcrumbs = response.css('.breadcrumb a')
        hierarchy = []
        for i, crumb in enumerate(breadcrumbs):
            name = crumb.css('::text').get().strip()
            url = urljoin(response.url, crumb.css('::attr(href)').get())
            hierarchy.append({
                'name': name,
                'url': url,
                'level': i      # 0表示顶级分类
            })
        return hierarchy

Usage: inparse_product_detailCall this tool and store the returned path information initem['category_path'](or separate fields) to facilitate subsequent group analysis.

Intelligent page turning processing

Page turning logic cannot rely solely on a "next page" button. Some websites do not have a clear next page link, so they need to parse the page number from the current URL and construct the next page URL. The following utility class takes care of both situations:

# ecommerce_spider/utils/pagination_handler.py
import re
from urllib.parse import urlparse, parse_qs, urlencode

class PaginationHandler:
    def handle(self, response, max_pages):
        """返回下一页URL,若无则返回None"""
        current_page = self._extract_current_page(response.url)
        if current_page >= max_pages:
            return None

        # 方式一:从页面中寻找 rel="next" 或 .next 的链接
        next_link = response.css('a[rel="next"]::attr(href), a.next::attr(href)').get()
        if next_link:
            return urljoin(response.url, next_link)

        # 方式二:通过URL参数自己生成
        return self._generate_next_url(response.url, current_page)

    def _extract_current_page(self, url):
        parsed = urlparse(url)
        query = parse_qs(parsed.query)
        if 'page' in query:
            return int(query['page'][0])
        # 也可能页码在路径中,比如 /page/3/
        match = re.search(r'/page/(\d+)', url)
        return int(match.group(1)) if match else 1

    def _generate_next_url(self, url, current_page):
        parsed = urlparse(url)
        query = parse_qs(parsed.query)
        query['page'] = [current_page + 1]
        new_query = urlencode(query, doseq=True)
        return parsed._replace(query=new_query).geturl()

Integrate this tool intoparse_product_list, replace the original simple.nextSearch can greatly improve the robustness of page turning.

Anti-reptile countermeasures

All the logic mentioned above only makes sense if the request can be returned successfully. Therefore we need to arm our crawler with middleware.

Anti-crawler middleware

This middleware will randomly replace the User-Agent and some common request headers before each request is issued, while adding a random delay to make the crawler behave more like a normal browser.

# ecommerce_spider/middlewares/anti_crawler_middleware.py
import random
import time
from fake_useragent import UserAgent

class AntiCrawlerMiddleware:
    def __init__(self):
        self.ua = UserAgent()

    def process_request(self, request, spider):
        # 随机User-Agent
        request.headers['User-Agent'] = self.ua.random
        # 模拟浏览器的 Accept 头部
        request.headers['Accept'] = (
            'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
        )
        request.headers['Accept-Language'] = 'zh-CN,zh;q=0.9,en;q=0.8'
        # 微小随机延迟(注意:这里会阻塞请求,生产环境建议用 Downloader 中间件延迟)
        time.sleep(random.uniform(1, 3))
        return None

Proxy middleware

For websites whose IPs are easily blocked, changing agents is the most direct response. The following is a minimalist polling agent implementation:

# ecommerce_spider/middlewares/proxy_middleware.py
import random

class ProxyMiddleware:
    def __init__(self):
        self.proxies = [
            'http://proxy1:port',
            'http://proxy2:port',
        ]

    def process_request(self, request, spider):
        if self.proxies:
            request.meta['proxy'] = random.choice(self.proxies)
        return None

Notice:

  • In actual projects, the agent list may come from files or paid agent API, and you can update it dynamicallyself.proxies
  • This middleware is set before the request is maderequest.meta['proxy'], Scrapy will automatically use this proxy for downloading.
  • Requires cooperationDOWNLOADER_MIDDLEWARESexistsettings.pyEnable these middlewares in .

Data deduplication and cleaning

It is inevitable that duplicate data will appear during the crawling process, or that a certain crawl will return blank/invalid information. These problems can be handled centrally through pipelines.

Data cleaning pipeline

Use the previously definedItemLoaderandProductItem, the pipeline can automatically apply cleaning rules for fields such as price and title, while eliminating "junk" items with only some fields.

# ecommerce_spider/pipelines/cleaning_pipeline.py
from ecommerce_spider.items import ProductItem
from itemloaders import ItemLoader

class CleaningPipeline:
    def process_item(self, item, spider):
        # 如果是用 ItemLoader 生成的 item 可以直接返回
        # 这里演示手动校验
        loader = ItemLoader(item=ProductItem(), response=item.get('response'))
        title = item.get('title', '')
        loader.add_value('title', title.strip())

        price = item.get('current_price')
        loader.add_value('current_price', price)

        # 使用 loader 加载完成后,获取处理过的字段
        cleaned_item = loader.load_item()

        # 验证必需字段是否存在且有效
        if not cleaned_item.get('title') or not cleaned_item.get('current_price'):
            spider.logger.warning(
                f"Missing required fields for item: {item.get('url')}"
            )
            # 返回 None 表示丢弃这个 item
            return None
        return cleaned_item

Deduplication pipeline

The simplest deduplication is to maintain a collection of processed identifiers in the pipeline. For products,product_id+urlUsually a record can be uniquely identified.

# ecommerce_spider/pipelines/deduplication_pipeline.py
class DeduplicationPipeline:
    def __init__(self):
        self.seen = set()

    def process_item(self, item, spider):
        identifier = f"{item.get('product_id')}_{item.get('url')}"
        if identifier in self.seen:
            spider.logger.info(f"Duplicate item: {item.get('title')}")
            return None
        self.seen.add(identifier)
        return item

Performance Optimization and Deployment

Performance optimization configuration

existsettings.py, reasonable adjustment of concurrency and delay can double the crawling speed while avoiding triggering anti-crawling.

# ecommerce_spider/settings.py
# 全局并发请求数
CONCURRENT_REQUESTS = 32
# 对单个域名的并发请求数
CONCURRENT_REQUESTS_PER_DOMAIN = 8
# 下载延迟(秒),建议设为 1 或稍大
DOWNLOAD_DELAY = 1
# 在延迟基础上加入随机浮动 (0.5 * DOWNLOAD_DELAY)
RANDOMIZE_DOWNLOAD_DELAY = 0.5
# 启用自动限速,根据延迟和并发自动调整
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0

meaning:AUTOTHROTTLEThe request rate will be dynamically adjusted based on the actual response time, so that your crawler can run fast without overwhelming the website.

Docker deployment

Packaging the entire project into a Docker image can save you the trouble of environment configuration and facilitate horizontal expansion on the server.

# Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["scrapy", "crawl", "ecommerce"]
# docker-compose.yml
version: '3.8'
services:
  spider:
    build: .
    volumes:
      - ./output:/app/output   # 将输出目录挂载出来
  mongodb:
    image: mongo:4.4
    ports:
      - "27017:27017"
    volumes:
      - mongodb_data:/data/db
volumes:
  mongodb_data:

When deploying, just execute in the directory:

docker-compose up -d

The crawled data can be stored in MongoDB or modified throughpipelinesSave as file.


At this point, you have completely built a usable e-commerce full-site crawling system. From understanding the website structure, to designing the data model, to writing crawler logic, anti-crawling confrontation, data cleaning and final deployment - every link has been polished by actual combat. Apply this framework to different e-commerce websites and simply adjust the CSS selector and some rules to quickly complete the data collection task. Start your reptile journey!