E-commerce App Crawler Comprehensive Practical Project Guide
Important preface: This project is only for technical learning and exchange. Please strictly abide by the Robots agreement, e-commerce platform rules and relevant laws and regulations, reasonably control the frequency of requests, and never illegally obtain sensitive personal information or make commercial profits.
When crawling e-commerce App data, we often face several practical problems: SSL Pinning leads to a blank packet capture tool, request signature reverse cost is high and the interface changes whenever it changes, multi-device scheduling is confusing, IP is blocked and there is no way to appeal, repeated crawling of the same product is a waste of resources... Is there a lightweight and solid way to solve these pain points at once?
This tutorial will take you to build a lightweight full-link e-commerce App crawler system from scratch. It starts from bypassing SSL Pinning, uses automated operations and packet capture to assist in data completion, and then combines it with Redis for task scheduling and deduplication, local agent pool anti-blocking, and multi-device collaborative work, all in one go. Even if you only have one idle mobile phone, you can quickly run through the entire process; if you can add a few more devices, you can naturally expand into a distributed small cluster.
Overview of core technology architecture
The entire system adopts a lightweight layered asynchronous architecture, with loose coupling between modules. A single device can run independently, and it also supports horizontal expansion of multiple devices:
- Control Terminal: Developers issue crawling tasks (such as which category to crawl and how many pages).
- Redis Scheduling Center: Responsible for priority task queue, URL deduplication, and device status management.
- Crawler worker node: Python script running on the Android device (or emulator), through
uiautomator2Control the application, and optionally use packet capture to obtain more complete data. - Agent Pool System: Provides dynamically changing proxy IPs for crawlers to reduce the risk of a single IP being blocked.
- Data processing pipeline: Clean, deduplicate, and format the original data crawled back.
- MySQL Storage: Persist the final structured data.
Below we implement each core module one by one in order from bottom to top.
Core module implemented module by module
1. Anti-anti-crawler basics: SSL Pinning one-click bypass
Packet capture is the first step to analyze App data, but most e-commerce apps will use OkHttp3/OkHttp4CertificatePinnerTo verify the server certificate, only a bunch of garbled characters can be seen in Charles or Fiddler or the connection fails. Fortunately, this protection method is relatively fixed. We can use Frida to dynamically inject a script and "empty" it before the App is started.
Below is an encapsulated Python class that can easily specify the device in a group control scenario, automatically attach to the target App and load the universal bypass script.
When using it, you only need to pass in the package name:
Frida will be injected into the App process, and then you can see the clear text request in the packet capture tool. This set of scripts is common to most SSL Pinning implementations based on OkHttp.
2. Lightweight and stable crawling layer: uiautomator2 automation
If you don’t have enough energy to reverse the encrypted request signature, or the target app is frequently updated and the signature algorithm changes every three days, UI automation + packet capture assistance is a stable and worry-free compromise.
Here we useuiautomator2The library is used to control the device. It can simulate clicks, slides, and read interface elements, and can basically meet the data extraction needs of product lists and detail pages. belowProductCrawlerThe class implements a basic product crawler:
- Random Delay: Simulate pauses in human operations to avoid being recognized by anti-automation mechanisms.
- Element positioning: via common
resource-idPattern matching to get product title and price. - scroll page: use
swipeSimulate sliding and remove duplicates before each extraction to prevent repeated crawling.
Note: UI automation is greatly affected by device performance and network fluctuations. It is recommended to use a retry mechanism and exception recovery logic. In addition, if the App page structure changes significantly, you may need to manually adjust it.resourceIdMatchesregular expression.
3. Task scheduling + IP management core module
When the number of crawling tasks increases, or multiple devices are required to work in parallel, a dispatch center is needed to allocate tasks, remove duplicates, and provide a stable proxy IP for crawlers to prevent them from being banned.
3.1 Lightweight Redis priority task queue
Redis is naturally suitable for queues. We use its List to implement priority queues, Set to complete deduplication marking, and Hash to store task details. Here is defined aRedisTaskQueueClass, supports:
- Priority: Divided into three levels: high, normal, and low. High-priority tasks will be consumed first.
- Remove: optional
unique_key(such as product ID), if it is found that the key is already inseen_set, the task will be skipped. - Task details expiration: Task data is set to be valid for 7 days to avoid occupying too much memory.
Usage example:
The crawler worker node will call when idleget_task()Block and wait for new tasks, and execute the fetching logic after getting the tasks. After the task is completed, you can delete it by{prefix}queue:high:processingtask_id in to confirm completion, or cooperate with the exception retry mechanism to put it back into the queue.
3.2 Minimalist local proxy pool (based on HTTPbin verification)
IP proxies are key to preventing blocking. For small-scale projects, the cost of maintaining a dynamic agent pool is high. We can first use a predefined agent list and filter the available agents through periodic verification. belowSimpleProxyPoolAll agents in the list will be verified during initialization and live agents will be stored in the list for each subsequent call.get_random()Returns a random one.
In the actual crawler script, we canrequestsoruiautomator2In the HTTP request made, by setting the environment variableHTTP_PROXY / HTTPS_PROXYOr specify the proxy directly in the code to use this pool. In order to improve proxy utilization, you can also re-verify the proxy list every few minutes, eliminate invalid ones, and add new ones.
Precautions for project implementation
-
Legal compliance always comes first Do not crawl users’ private data, do not put excessive pressure on platform services, and do not use captured data for commercial resale. Technology is innocent, but if used in the wrong place, it will violate the red line.
-
Performance and Stability Balance For single-device UI automation, it is recommended that the number of concurrent threads be ≤ 2, otherwise the device may freeze or even crash. The agent needs to be refreshed regularly, and the task status in Redis must cooperate with the timeout retry mechanism to avoid task stuck.
-
Optimization of deduplication mechanism Except based on
unique_keyFor Redis Set deduplication, you can also calculate MD5 values for product titles, prices, image links, etc. as auxiliary deduplication identifiers to further improve data quality. -
Logs and Monitoring It is highly recommended to use Python’s built-in
loggingThe module comprehensively records operating information and stores the device status (idle, working, abnormal) in Redis Hash, making it easy to view the overall operating status through a simple command line script or web panel.
You now have mastered the core skeleton of a complete e-commerce App crawler system. Next, you can adjust element positioning rules, improve exception-handling, access more proxy sources, and even join a mobile group control platform (such as STF or minicap/minitouch) to uniformly manage dozens of devices based on the characteristics of the actual target App. I wish you good luck in your crawler journey, but remember - technology is for good and move forward in compliance!

