Scrapy e-commerce full-site crawling practical project - building a multi-level e-commerce data capture system from scratch
📂 Stage: Stage 4 - Practical Exercise (Project Development) 🔗 Related chapters: Spider实战 · Selector选择器 · Pipeline管道实战 · 反爬对抗实战
Table of contents
Project background and goals
E-commerce data is the “fuel” for many business decisions—price comparison, product selection, and competitive product monitoring are all inseparable from it. But in reality, e-commerce websites often have complex structures: categories are nested layer by layer, product lists are listed one page after another, and various anti-crawling methods are used to secretly block them. Whole-site crawling is the process of letting crawlers wander all the way from the home page to the details page like a customer, and extract all valuable information.
This project will lead you to build a production-level data capture system. The core capabilities include:
- Multi-level classification processing: Automatically identify and recursively crawl the classification tree of the website, from the first-level category to hundreds of sub-categories.
- Deep Page Turning: Intelligently determine page turning links or generate URL parameters to ensure that there are no "missing pages" and to prevent endless loops.
- Accurate data extraction: Use CSS selectors and regular expressions to extract key fields such as title, price, specifications, pictures, etc. from complex-structured details pages.
- Anti-crawler countermeasures: Integrate random User‑Agent, request header camouflage, agent rotation and delay strategies to make crawlers more "human-like".
- Data Quality Assurance: Built-in deduplication, cleaning, and validity verification, the output data can be directly used for analysis.
Project technology stack
- Crawler Framework: Scrapy 2.x, a proven asynchronous crawler engine.
- Data processing: Pydantic or Item defines structured data, and cooperates with ItemLoader to automatically clean and convert it.
- Storage Plan: Choose from MongoDB, MySQL, and even export CSV/JSON files.
- Agent Management: Self-built proxy middleware, simple and controllable; it can also be connected to extensions such as Scrapy‑Proxy‑Pool.
Once we are ready, we start by analyzing the structure of the target website.
E-commerce website architecture analysis
No matter how fancy a target website looks, its internal structure usually conforms to a classic four-layer model:
- Homepage: Top navigation bar, side category menu. Entrances to all subcategories are concentrated here.
- Category page: Either display the next level sub-category, or directly display the product list under this category, or both.
- List page: It consists of dozens of product cards, usually with paginator and sorting/filtering controls.
- Details page: All information about a single product, including title, price, attributes, pictures, reviews, etc.
The crawling process unfolds along this structure:
The real difficulty is often not in the logic itself, but in:
- The level of classification is not fixed (maybe 2 levels or 4 levels), and a general recursive parser needs to be designed.
- There are various implementations of page turning: it may be
?page=2query parameters, which may also be/page/3/path form, or even a "click to load more" button. - The website will quietly detect crawlers: common anti-crawling methods include checking User-Agent, limiting request frequency, blocking IP, etc.
Before we start coding, we first plan the overall structure of the project.
Project Architecture Design
Clear directory division will make subsequent development smoother. The following structure is our recommended layout:
Core component relationships
The flowchart below summarizes how the components work in series:
Note: The solid arrows indicate the flow of data/requests, and the dotted lines indicate that the middleware "intercepts" and processes the requests behind the scenes.
Next, we start with the data model and implement each module step by step.
Data model definition
In Scrapy, usually useItemThe class describes the data structure we want to crawl. CooperateItemLoader, which can easily complete cleaning and format conversion when filling data.
Interpretation:
clean_priceThe function is responsible for turning strings such as "¥1,299.00" into computable floating point numbers.input_processorandoutput_processorDefines automatic processing rules for data entering and exiting fields.- picture
category_pathIn that way, the values of multiple fields with the same name will be automatically concatenated into strings for subsequent storage.
With the data blueprint in hand, we can start writing crawler logic.
Core function implementation
Main crawler framework
The main crawler is the "scheduling center" of the entire system. It is responsible for:
- Get all category links from the starting URL;
- For each category, request its product list page;
- Grab the product details link in the list page and hand it over to the details parsing method;
- Control page turning to ensure there is no infinite loop.
💡 Tip: If you need to flexibly pass context (such as category name) in a certain step, you can
metaAdd corresponding fields to the dictionary, and subsequent methods passresponse.metaJust take it out.
Multi-level classification analysis
E-commerce websites often use "breadcrumbs" to display the category level the user is currently at. We can extract the complete classification path from these elements for subsequent data analysis.
Usage: inparse_product_detailCall this tool and store the returned path information initem['category_path'](or separate fields) to facilitate subsequent group analysis.
Intelligent page turning processing
Page turning logic cannot rely solely on a "next page" button. Some websites do not have a clear next page link, so they need to parse the page number from the current URL and construct the next page URL. The following utility class takes care of both situations:
Integrate this tool intoparse_product_list, replace the original simple.nextSearch can greatly improve the robustness of page turning.
Anti-reptile countermeasures
All the logic mentioned above only makes sense if the request can be returned successfully. Therefore we need to arm our crawler with middleware.
Anti-crawler middleware
This middleware will randomly replace the User-Agent and some common request headers before each request is issued, while adding a random delay to make the crawler behave more like a normal browser.
Proxy middleware
For websites whose IPs are easily blocked, changing agents is the most direct response. The following is a minimalist polling agent implementation:
Notice:
- In actual projects, the agent list may come from files or paid agent API, and you can update it dynamically
self.proxies。 - This middleware is set before the request is made
request.meta['proxy'], Scrapy will automatically use this proxy for downloading. - Requires cooperation
DOWNLOADER_MIDDLEWARESexistsettings.pyEnable these middlewares in .
Data deduplication and cleaning
It is inevitable that duplicate data will appear during the crawling process, or that a certain crawl will return blank/invalid information. These problems can be handled centrally through pipelines.
Data cleaning pipeline
Use the previously definedItemLoaderandProductItem, the pipeline can automatically apply cleaning rules for fields such as price and title, while eliminating "junk" items with only some fields.
Deduplication pipeline
The simplest deduplication is to maintain a collection of processed identifiers in the pipeline. For products,product_id+urlUsually a record can be uniquely identified.
Performance Optimization and Deployment
Performance optimization configuration
existsettings.py, reasonable adjustment of concurrency and delay can double the crawling speed while avoiding triggering anti-crawling.
meaning:AUTOTHROTTLEThe request rate will be dynamically adjusted based on the actual response time, so that your crawler can run fast without overwhelming the website.
Docker deployment
Packaging the entire project into a Docker image can save you the trouble of environment configuration and facilitate horizontal expansion on the server.
When deploying, just execute in the directory:
The crawled data can be stored in MongoDB or modified throughpipelinesSave as file.
At this point, you have completely built a usable e-commerce full-site crawling system. From understanding the website structure, to designing the data model, to writing crawler logic, anti-crawling confrontation, data cleaning and final deployment - every link has been polished by actual combat. Apply this framework to different e-commerce websites and simply adjust the CSS selector and some rules to quickly complete the data collection task. Start your reptile journey!

