E-commerce App Crawler Comprehensive Practical Project Guide

Important preface: This project is only for technical learning and exchange. Please strictly abide by the Robots agreement, e-commerce platform rules and relevant laws and regulations, reasonably control the frequency of requests, and never illegally obtain sensitive personal information or make commercial profits.

When crawling e-commerce App data, we often face several practical problems: SSL Pinning leads to a blank packet capture tool, request signature reverse cost is high and the interface changes whenever it changes, multi-device scheduling is confusing, IP is blocked and there is no way to appeal, repeated crawling of the same product is a waste of resources... Is there a lightweight and solid way to solve these pain points at once?

This tutorial will take you to build a lightweight full-link e-commerce App crawler system from scratch. It starts from bypassing SSL Pinning, uses automated operations and packet capture to assist in data completion, and then combines it with Redis for task scheduling and deduplication, local agent pool anti-blocking, and multi-device collaborative work, all in one go. Even if you only have one idle mobile phone, you can quickly run through the entire process; if you can add a few more devices, you can naturally expand into a distributed small cluster.


Overview of core technology architecture

The entire system adopts a lightweight layered asynchronous architecture, with loose coupling between modules. A single device can run independently, and it also supports horizontal expansion of multiple devices:

flowchart LR
    A[控制终端] -->|提交任务| B[Redis调度中心]
    B -->|优先级下发| C1[爬虫工作节点1]
    B -->|优先级下发| C2[爬虫工作节点2]
    C1 --> D[代理池系统]
    C2 --> D
    C1 -->|解析数据| E[数据处理管道]
    C2 --> E
    E -->|持久化| F[MySQL存储]
  • Control Terminal: Developers issue crawling tasks (such as which category to crawl and how many pages).
  • Redis Scheduling Center: Responsible for priority task queue, URL deduplication, and device status management.
  • Crawler worker node: Python script running on the Android device (or emulator), throughuiautomator2Control the application, and optionally use packet capture to obtain more complete data.
  • Agent Pool System: Provides dynamically changing proxy IPs for crawlers to reduce the risk of a single IP being blocked.
  • Data processing pipeline: Clean, deduplicate, and format the original data crawled back.
  • MySQL Storage: Persist the final structured data.

Below we implement each core module one by one in order from bottom to top.


Core module implemented module by module

1. Anti-anti-crawler basics: SSL Pinning one-click bypass

Packet capture is the first step to analyze App data, but most e-commerce apps will use OkHttp3/OkHttp4CertificatePinnerTo verify the server certificate, only a bunch of garbled characters can be seen in Charles or Fiddler or the connection fails. Fortunately, this protection method is relatively fixed. We can use Frida to dynamically inject a script and "empty" it before the App is started.

Below is an encapsulated Python class that can easily specify the device in a group control scenario, automatically attach to the target App and load the universal bypass script.

import frida
import sys

class SSLPinningBypass:
    """一键通用OkHttp3/OkHttp4 SSL Pinning绕过类"""
    
    def __init__(self, device_serial=None):
        """可选指定设备(群控场景常用)"""
        try:
            self.device = frida.get_device(device_serial) if device_serial else frida.get_usb_device(timeout=3)
        except Exception as e:
            print(f"✗ 连接设备失败: {e}")
            sys.exit(1)
        self.session = None
        self.script = None
    
    def attach_and_bypass(self, pkg_name: str):
        """附加App + 加载绕过脚本"""
        try:
            # 优先附加已启动App,否则自动重启绕过
            self.session = self.device.attach(pkg_name)
            print(f"✓ 已附加到 {pkg_name}")
        except frida.ProcessNotFoundError:
            print(f"✓ {pkg_name} 未启动,正在启动并提前注入绕过脚本...")
            pid = self.device.spawn([pkg_name])
            self.session = self.device.attach(pid)
            self.device.resume(pid)
        
        # 加载极简通用绕过脚本(适配80%+场景)
        self._load_bypass_script()
    
    def _load_bypass_script(self):
        """Frida Hook脚本:覆盖OkHttp3/4的check+CertificatePinner构造"""
        bypass_js = """
        console.log("[√] Frida SSL Pinning通用Hook启动...");
        
        // 1. OkHttp3/4 CertificatePinner.check 所有重载直接空实现
        try {
            var CertPinner = Java.use("okhttp3.CertificatePinner");
            CertPinner.check.overload('java.lang.String', 'java.util.List').implementation = function() { return; };
            CertPinner.check.overload('java.lang.String', 'java.security.cert.Certificate').implementation = function() { return; };
            console.log("[√] OkHttp3/4 CertificatePinner.check 已全部禁用");
        } catch (e) {
            console.log("[×] 未找到OkHttp3/4 CertificatePinner");
        }
        
        // 2. 兜底禁用TrustManager
        try {
            var X509TrustManager = Java.use("javax.net.ssl.X509TrustManager");
            var EmptyTrustManager = Java.registerClass({
                name: "com.frida.EmptyTrustManager",
                implements: [X509TrustManager],
                methods: {
                    checkClientTrusted: function() {},
                    checkServerTrusted: function() {},
                    getAcceptedIssuers: function() { return []; }
                }
            });
            console.log("[√] 已注册空TrustManager");
        } catch (e) {
            console.log("[×] 兜底TrustManager注册失败: " + e);
        }
        """
        
        self.script = self.session.create_script(bypass_js)
        self.script.load()
        print("[√] 绕过脚本加载完成,请使用Charles/Fiddler抓包!")

When using it, you only need to pass in the package name:

bypass = SSLPinningBypass()
bypass.attach_and_bypass("com.example.shopping")

Frida will be injected into the App process, and then you can see the clear text request in the packet capture tool. This set of scripts is common to most SSL Pinning implementations based on OkHttp.


2. Lightweight and stable crawling layer: uiautomator2 automation

If you don’t have enough energy to reverse the encrypted request signature, or the target app is frequently updated and the signature algorithm changes every three days, UI automation + packet capture assistance is a stable and worry-free compromise.

Here we useuiautomator2The library is used to control the device. It can simulate clicks, slides, and read interface elements, and can basically meet the data extraction needs of product lists and detail pages. belowProductCrawlerThe class implements a basic product crawler:

  • Random Delay: Simulate pauses in human operations to avoid being recognized by anti-automation mechanisms.
  • Element positioning: via commonresource-idPattern matching to get product title and price.
  • scroll page: useswipeSimulate sliding and remove duplicates before each extraction to prevent repeated crawling.
import uiautomator2 as u2
import time
import random

class ProductCrawler:
    """单设备商品列表+详情基础爬取器"""
    
    def __init__(self, device_serial=None, wait_base=2, wait_range=3):
        """设置随机延迟,模拟人类操作"""
        self.d = u2.connect(device_serial) if device_serial else u2.connect()
        self.wait_base = wait_base
        self.wait_range = wait_range
    
    def _human_wait(self):
        time.sleep(random.uniform(self.wait_base, self.wait_base + self.wait_range))
    
    def _extract_list_page(self) -> list[dict]:
        """提取当前可见商品列表,适配常见的resourceId格式"""
        products = []
        # 尝试匹配常见的商品容器+标题价格id
        try:
            containers = self.d(resourceIdMatches=r".*product.*item|.*item.*product")
            if not containers.exists:
                containers = self.d(scrollable=True).child(className="android.view.ViewGroup")[:10]
            for c in containers[:10]:  # 避免重复滚动提取同一元素
                title = c.child(resourceIdMatches=r".*title|.*name").get_text() if c.child(resourceIdMatches=r".*title|.*name").exists else ""
                price = c.child(resourceIdMatches=r".*price").get_text() if c.child(resourceIdMatches=r".*price").exists else ""
                if title and price:
                    products.append({"title": title.strip(), "price": price.strip(), "extracted_at": time.strftime("%Y-%m-%d %H:%M:%S")})
        except Exception as e:
            print(f"[×] 提取列表页失败: {e}")
        return products
    
    def crawl_category(self, category_name: str, max_scrolls=5) -> list[dict]:
        """根据分类名称进入并爬取(需要提前配置入口坐标或文本)"""
        # 这里简化为:通过文本定位分类入口并点击
        try:
            self.d(text=category_name).click(timeout=10)
            self._human_wait()
        except Exception as e:
            print(f"[×] 找不到分类 {category_name}: {e}")
            return []
        
        all_products = []
        seen_titles = set()
        for _ in range(max_scrolls):
            # 提取当前页+去重
            current = self._extract_list_page()
            for p in current:
                if p["title"] not in seen_titles:
                    seen_titles.add(p["title"])
                    all_products.append(p)
            # 模拟人类滑动
            self.d.swipe(500, 1800, 500, 500, duration=0.3)
            self._human_wait()
        print(f"[√] 分类 {category_name} 共爬取 {len(all_products)} 个不重复商品")
        return all_products

Note: UI automation is greatly affected by device performance and network fluctuations. It is recommended to use a retry mechanism and exception recovery logic. In addition, if the App page structure changes significantly, you may need to manually adjust it.resourceIdMatchesregular expression.


3. Task scheduling + IP management core module

When the number of crawling tasks increases, or multiple devices are required to work in parallel, a dispatch center is needed to allocate tasks, remove duplicates, and provide a stable proxy IP for crawlers to prevent them from being banned.

3.1 Lightweight Redis priority task queue

Redis is naturally suitable for queues. We use its List to implement priority queues, Set to complete deduplication marking, and Hash to store task details. Here is defined aRedisTaskQueueClass, supports:

  • Priority: Divided into three levels: high, normal, and low. High-priority tasks will be consumed first.
  • Remove: optionalunique_key(such as product ID), if it is found that the key is already inseen_set, the task will be skipped.
  • Task details expiration: Task data is set to be valid for 7 days to avoid occupying too much memory.
import redis
import json
import time
from enum import IntEnum

class TaskPriority(IntEnum):
    LOW = 1
    NORMAL = 2
    HIGH = 3

class RedisTaskQueue:
    """Redis List+Set实现的优先级+去重任务队列"""
    
    def __init__(self, r_host="localhost", r_port=6379, r_db=0):
        self.r = redis.StrictRedis(host=r_host, port=r_port, db=r_db, decode_responses=True)
        self.prefix = "app_crawler:"
        self.task_key = f"{self.prefix}tasks:"
        self.priority_queue = {
            TaskPriority.HIGH: f"{self.prefix}queue:high",
            TaskPriority.NORMAL: f"{self.prefix}queue:normal",
            TaskPriority.LOW: f"{self.prefix}queue:low"
        }
        self.seen_set = f"{self.prefix}seen_urls"
    
    def add_task(self, task_id: str, task_data: dict, priority=TaskPriority.NORMAL, unique_key=None) -> bool:
        """添加任务,可选unique_key去重"""
        if unique_key and self.r.sismember(self.seen_set, unique_key):
            return False
        # 存储任务详情(7天过期)
        self.r.setex(f"{self.task_key}{task_id}", 7*24*3600, json.dumps(task_data))
        # 加入优先级队列
        self.r.lpush(self.priority_queue[priority], task_id)
        if unique_key:
            self.r.sadd(self.seen_set, unique_key)
        return True
    
    def get_task(self, timeout=5) -> tuple[str, dict] | None:
        """按优先级获取任务,超时5秒"""
        for q in self.priority_queue.values():
            task_id = self.r.rpoplpush(q, f"{q}:processing", timeout=timeout)
            if task_id:
                task_data = json.loads(self.r.get(f"{self.task_key}{task_id}"))
                return task_id, task_data
        return None

Usage example:

queue = RedisTaskQueue()
# 添加一个高优先级任务,并指定 unique_key 为商品ID
queue.add_task("task_001", {"url": "https://example.com/product/123"}, 
               priority=TaskPriority.HIGH, unique_key="product:123")
# 获取任务
task_id, task_data = queue.get_task()

The crawler worker node will call when idleget_task()Block and wait for new tasks, and execute the fetching logic after getting the tasks. After the task is completed, you can delete it by{prefix}queue:high:processingtask_id in to confirm completion, or cooperate with the exception retry mechanism to put it back into the queue.

3.2 Minimalist local proxy pool (based on HTTPbin verification)

IP proxies are key to preventing blocking. For small-scale projects, the cost of maintaining a dynamic agent pool is high. We can first use a predefined agent list and filter the available agents through periodic verification. belowSimpleProxyPoolAll agents in the list will be verified during initialization and live agents will be stored in the list for each subsequent call.get_random()Returns a random one.

import requests
import random
import time

class SimpleProxyPool:
    """基于预定义代理列表+HTTPbin验证的本地代理池"""
    
    def __init__(self, proxy_list: list[str], check_url="http://httpbin.org/ip", timeout=3):
        """proxy_list格式: ['http://user:pass@ip:port', 'socks5://ip:port']"""
        self.raw_list = proxy_list
        self.check_url = check_url
        self.timeout = timeout
        self.alive_proxies = []
        self._init_proxies()
    
    def _check_one(self, proxy: str) -> bool:
        try:
            resp = requests.get(self.check_url, proxies={"http": proxy, "https": proxy}, timeout=self.timeout)
            return resp.status_code == 200
        except:
            return False
    
    def _init_proxies(self):
        for p in self.raw_list:
            if self._check_one(p):
                self.alive_proxies.append(p)
                print(f"[√] 代理 {p.split('@')[-1]} 验证通过")
        print(f"[√] 代理池初始化完成,共 {len(self.alive_proxies)} 个可用代理")
    
    def get_random(self) -> str | None:
        if not self.alive_proxies:
            return None
        return random.choice(self.alive_proxies)

In the actual crawler script, we canrequestsoruiautomator2In the HTTP request made, by setting the environment variableHTTP_PROXY / HTTPS_PROXYOr specify the proxy directly in the code to use this pool. In order to improve proxy utilization, you can also re-verify the proxy list every few minutes, eliminate invalid ones, and add new ones.


Precautions for project implementation

  1. Legal compliance always comes first Do not crawl users’ private data, do not put excessive pressure on platform services, and do not use captured data for commercial resale. Technology is innocent, but if used in the wrong place, it will violate the red line.

  2. Performance and Stability Balance For single-device UI automation, it is recommended that the number of concurrent threads be ≤ 2, otherwise the device may freeze or even crash. The agent needs to be refreshed regularly, and the task status in Redis must cooperate with the timeout retry mechanism to avoid task stuck.

  3. Optimization of deduplication mechanism Except based onunique_keyFor Redis Set deduplication, you can also calculate MD5 values ​​for product titles, prices, image links, etc. as auxiliary deduplication identifiers to further improve data quality.

  4. Logs and Monitoring It is highly recommended to use Python’s built-inloggingThe module comprehensively records operating information and stores the device status (idle, working, abnormal) in Redis Hash, making it easy to view the overall operating status through a simple command line script or web panel.


You now have mastered the core skeleton of a complete e-commerce App crawler system. Next, you can adjust element positioning rules, improve exception-handling, access more proxy sources, and even join a mobile group control platform (such as STF or minicap/minitouch) to uniformly manage dozens of devices based on the characteristics of the actual target App. I wish you good luck in your crawler journey, but remember - technology is for good and move forward in compliance!