E-commerce App product sliding capture project

This article provides a set of Android e-commerce App sliding collection solutions that require no root, are lightweight and easy to deploy. Our Python wrapper based on Google's UI testing tool -uiautomator2, compared to Appium, configuration is faster and the possibility of triggering risk control is lower. It is quickly compatible with multiple platforms through the adapter mode, and has built-in SQLite persistent storage and basic data analysis capabilities based on pandas/matplotlib. The core code can be run directly.

⚠️ Legal and Compliance Statement: This project is only used for personal technical learning and research. Batch capture of data from unauthorized platforms is strictly prohibited. Please be sure to abide by the "User Agreement", "Privacy Policy" and relevant laws and regulations of each platform, and reasonably control the collection frequency and single/total collection volume!


1. Core architecture: three-module sliding acquisition engine

We split the system into three low-coupling modules: Configuration Management, SQLite Light Storage and UI Interaction and Extraction to facilitate rapid iteration and expansion.

Why choose uiautomator2 first?

The following comparison table can help you quickly understand the characteristics of different solutions:

Solution comparisonAPI collectionAppiumuiautomator2
Configuration complexityMedium (requires reverse engineering or obtaining legal token)High (requires Node.js + Server)Low (one-line pip installation + init)
Risk control trigger probabilityHigh (interface has strict encryption/signature verification)Medium (automation features are obvious)Low (simulate native clicks and slides)
VersatilityWeak (needs to be rewritten after platform API changes)Medium (compatible with multiple systems but slower)Medium (Android only, but UI versatility is strong)
Sliding response speedFast (no UI rendering)Slow (cross-process call)Fast (directly drives the Android system UI)

Core code (lite version)

We have deleted redundant fields (such as discount coupons, user reviews, etc.) and retained the core logic applicable to the basic product list scenarios of most e-commerce apps. Anti-automatic detection methods such as randomized sliding and search processes have been built into the code.

# core_scraper.py
import uiautomator2 as u2
import time
import random
import json
import sqlite3
import re
import logging
from dataclasses import dataclass
from typing import Optional, Dict, List
from datetime import datetime

# ---------------------------
# 1. 日志与配置
# ---------------------------
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.FileHandler("scraper.log", encoding="utf-8"), logging.StreamHandler()]
)
logger = logging.getLogger(__name__)

@dataclass
class ScrapingConfig:
    """采集配置:应用包名、关键词、采集数量限制等"""
    app_package: str = "com.taobao.taobao"
    category_keywords: List[str] = None
    max_products_per_category: int = 20
    scroll_interval: float = 2.8       # 模拟真实用户浏览间隔(运行时随机 ±0.5 秒)
    max_retry_no_new: int = 5          # 连续 N 次无新商品则停止该分类
    
    def __post_init__(self):
        if not self.category_keywords:
            self.category_keywords = ["平价手机壳", "入门机械键盘"]

# ---------------------------
# 2. SQLite 本地存储
# ---------------------------
class EcommerceDB:
    """轻量本地数据库,存储商品信息和采集日志"""
    def __init__(self, path: str = "ecommerce.db"):
        self.path = path
        self._init_tables()
    
    def _init_tables(self):
        with sqlite3.connect(self.path) as conn:
            cursor = conn.cursor()
            # 商品表(使用临时 ID 去重,仅保留核心字段)
            cursor.execute('''
                CREATE TABLE IF NOT EXISTS products (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    temp_id TEXT UNIQUE,
                    title TEXT,
                    price REAL,
                    sales_count INTEGER,
                    shop_name TEXT,
                    category TEXT,
                    crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                )
            ''')
            # 采集会话日志,方便统计每次采集的效率
            cursor.execute('''
                CREATE TABLE IF NOT EXISTS crawl_logs (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    category TEXT,
                    products_crawled INTEGER,
                    duration_seconds INTEGER,
                    started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                )
            ''')
            conn.commit()
        logger.info("✅ 本地数据库初始化完成")
    
    def save_product(self, p: Dict):
        """保存单条商品信息,重复 temp_id 自动忽略"""
        with sqlite3.connect(self.path) as conn:
            try:
                cursor = conn.cursor()
                cursor.execute('''
                    INSERT OR IGNORE INTO products
                    (temp_id, title, price, sales_count, shop_name, category)
                    VALUES (?, ?, ?, ?, ?, ?)
                ''', (
                    p['temp_id'], p['title'][:120], p['price'], 
                    p['sales_count'], p['shop_name'][:60], p['category']
                ))
                conn.commit()
            except Exception as e:
                logger.warning(f"⚠️ 保存商品失败: {e}")
    
    def save_session_log(self, log: Dict):
        """记录一次关键词采集的结果"""
        with sqlite3.connect(self.path) as conn:
            cursor = conn.cursor()
            cursor.execute('''
                INSERT INTO crawl_logs
                (category, products_crawled, duration_seconds)
                VALUES (?, ?, ?)
            ''', (log['category'], log['count'], log['duration']))
            conn.commit()

# ---------------------------
# 3. UI 交互与提取
# ---------------------------
class EcommerceScraper:
    """核心采集器:连接设备、执行搜索、滑动、提取商品信息"""
    def __init__(self, config: ScrapingConfig = None):
        self.config = config or ScrapingConfig()
        self.db = EcommerceDB()
        self.d = None
        self._connect_device()
    
    def _connect_device(self):
        """自动连接已开启 USB 调试的 Android 设备"""
        try:
            self.d = u2.connect()
            logger.info(f"✅ 设备连接成功: 序列号 {self.d.serial}")
            # 确保 ATX 服务正常运行
            self.d.app_start("com.github.uiautomator", stop=True)
            time.sleep(3)
        except Exception as e:
            logger.error(f"❌ 设备连接失败: {e}\n请检查 USB 调试、授权状态和驱动")
            exit(1)
    
    def launch_app_and_search(self, keyword: str) -> bool:
        """启动目标 App 并搜索指定关键词"""
        try:
            self.d.app_start(self.config.app_package, stop=True)
            logger.info(f"⏳ 等待应用完全启动...")
            time.sleep(7 + random.uniform(0, 2))
            
            # 适配主流电商平台的搜索框(优先资源 ID,其次文本/描述)
            search_box = None
            for rid in [
                "com.taobao.taobao:id/searchbar_hint_view",
                "com.jingdong.app.mall:id/search_widget_text",
                "com.xunmeng.pinduoduo:id/tv_search"
            ]:
                if self.d(resourceId=rid).exists(timeout=2):
                    search_box = self.d(resourceId=rid)
                    break
            if not search_box:
                search_box = self.d(descriptionMatches=r'^搜索.*$|^Search.*$')
            if not search_box:
                search_box = self.d(textMatches=r'^搜索.*$|^Search.*$')
            if not search_box or not search_box.click_exists(timeout=2):
                logger.error(f"❌ 未找到可用搜索框")
                return False
            
            # 清空并输入关键词
            time.sleep(1.2 + random.uniform(0, 1))
            self.d.clear_text()
            time.sleep(0.5)
            self.d.send_keys(keyword, clear=False)
            time.sleep(0.8)
            
            # 触发搜索(优先按钮,其次系统搜索键)
            if not self.d(textMatches=r'^搜索$|^Search$').click_exists(timeout=2):
                self.d.press("search")
            time.sleep(5 + random.uniform(0, 2))
            logger.info(f"✅ 搜索成功: {keyword}")
            return True
        except Exception as e:
            logger.error(f"❌ 启动或搜索失败: {e}")
            return False
    
    def _simulate_scroll_down(self) -> bool:
        """模拟真实用户的上滑浏览(带随机偏移,降低自动化特征)"""
        try:
            w, h = self.d.window_size()
            start_x = w // 2 + random.randint(-30, 30)
            start_y = int(h * 0.78 + random.randint(-20, 20))
            end_x = w // 2 + random.randint(-30, 30)
            end_y = int(h * 0.22 + random.randint(-20, 20))
            duration = 0.6 + random.uniform(0, 0.3)
            self.d.swipe(start_x, start_y, end_x, end_y, duration)
            time.sleep(self.config.scroll_interval + random.uniform(-0.5, 0.5))
            return True
        except Exception as e:
            logger.warning(f"⚠️ 模拟滑动失败: {e}")
            return False
    
    def _extract_single_product(self, container, category: str) -> Optional[Dict]:
        """从单个 UI 容器中提取商品核心信息(标题、价格、销量、店铺)"""
        try:
            # 生成临时唯一 ID,用于数据库去重
            temp_id = f"{category}_{int(time.time() * 1000)}_{random.randint(1000, 9999)}"
            title = ""
            price = 0.0
            sales = 0
            shop = ""
            
            # 提取标题:优先找包含较长文本且非价格开头的 TextView
            for tv in container(className="android.widget.TextView"):
                text = tv.get_text().strip()
                if len(text) > 8 and not text.startswith(("¥", "¥", "$", "€")):
                    title = text
                    break
            
            # 用 dump_hierarchy 配合正则快速提取价格、销量和店铺名(通用适配)
            hierarchy = container.dump_hierarchy()
            price_match = re.search(r'[¥¥](\d{1,6}\.?\d{0,2})', hierarchy)
            if price_match:
                price = float(price_match.group(1))
            
            sales_match = re.search(r'(\d+(?:\.\d+)?)(?:|)?(?:人付款|销量|已拼)', hierarchy)
            if sales_match:
                num = sales_match.group(1)
                unit = sales_match.group(0)[len(num):]
                if "万" in unit:
                    sales = int(float(num) * 10000)
                elif "千" in unit:
                    sales = int(float(num) * 1000)
                else:
                    sales = int(float(num))
            
            shop_match = re.search(r'([^\n]{2,40}?(?:旗舰店|专卖店|专营店|自营|官方))', hierarchy)
            if shop_match:
                shop = shop_match.group(1).strip()
            
            # 如果价格为零,视为无效占位,跳过
            if price == 0.0:
                return None
            
            return {
                "temp_id": temp_id,
                "title": title,
                "price": price,
                "sales_count": sales,
                "shop_name": shop,
                "category": category
            }
        except Exception as e:
            logger.debug(f"🔍 提取商品细节失败: {e}")
            return None
    
    def scrape_single_category(self, category: str) -> int:
        """采集单个分类(关键词)的商品,返回成功采集数量"""
        start_time = time.time()
        if not self.launch_app_and_search(category):
            return 0
        
        count = 0
        retry_no_new = 0
        seen_bounds = set()   # 用于去重容器
        
        while count < self.config.max_products_per_category and retry_no_new < self.config.max_retry_no_new:
            # 获取当前屏幕所有可能的商品容器
            containers = (
                self.d(className="android.widget.RelativeLayout").all()
                + self.d(className="android.widget.LinearLayout").all()
                + self.d(className="androidx.recyclerview.widget.RecyclerView").child().all()
            )
            new_found = False
            
            for c in containers:
                if count >= self.config.max_products_per_category:
                    break
                try:
                    # 根据容器边界去重,并过滤掉过小的无效控件
                    bounds = c.bounds()
                    b_key = (bounds['left'], bounds['top'], bounds['right'], bounds['bottom'])
                    if b_key in seen_bounds or bounds['bottom'] - bounds['top'] < 80:
                        continue
                    seen_bounds.add(b_key)
                    
                    product = self._extract_single_product(c, category)
                    if product:
                        self.db.save_product(product)
                        count += 1
                        new_found = True
                        logger.info(f"📦 已采集 {count}/{self.config.max_products_per_category}: {product['title'][:20]}...")
                except Exception as e:
                    logger.debug(f"🔄 处理 UI 容器失败: {e}")
            
            if not new_found:
                retry_no_new += 1
                logger.warning(f"⚠️ 未发现新商品,剩余重试次数: {self.config.max_retry_no_new - retry_no_new}")
                time.sleep(1.5)
            else:
                retry_no_new = 0
            
            # 需要继续滑动时,模拟一次上滑
            if count < self.config.max_products_per_category:
                self._simulate_scroll_down()
        
        duration = int(time.time() - start_time)
        self.db.save_session_log({"category": category, "count": count, "duration": duration})
        logger.info(f"🏁 分类 {category} 采集结束: 共 {count} 件, 耗时 {duration} 秒")
        return count
    
    def run_full_session(self):
        """运行完整的多分类采集会话"""
        logger.info("🚀 开始多分类采集会话")
        total = 0
        for i, cat in enumerate(self.config.category_keywords):
            total += self.scrape_single_category(cat)
            # 在两次关键词之间随机休息,避免触发平台风控
            if i < len(self.config.category_keywords) - 1:
                rest_time = random.uniform(18, 35)
                logger.info(f"😴 休息 {rest_time:.1f} 秒,避免频繁操作...")
                time.sleep(rest_time)
        logger.info(f"🎉 会话结束: 总计采集 {total} 件商品")

if __name__ == "__main__":
    # 修改下面的配置即可运行
    custom_config = ScrapingConfig(
        category_keywords=["便携保温杯", "百元蓝牙耳机"],
        max_products_per_category=12
    )
    scraper = EcommerceScraper(custom_config)
    scraper.run_full_session()

2. Data value-added: quickly view analysis results in 1 minute

After the collection is completed, you can use the following script to quickly count and visualize the data in SQLite. The code has solved the problem of Chinese displaying garbled characters.

# quick_analytics.py
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict

# ---------------------------
# 全局配置(解决中文乱码)
# ---------------------------
plt.rcParams['font.sans-serif'] = ['SimHei']          # Windows
# plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']  # macOS
# plt.rcParams['font.sans-serif'] = ['WenQuanYi Micro Hei']  # Linux
plt.rcParams['axes.unicode_minus'] = False

class QuickAnalytics:
    def __init__(self, db_path: str = "ecommerce.db"):
        self.db_path = db_path
    
    def _load_products(self) -> pd.DataFrame:
        """从 SQLite 加载商品数据"""
        with sqlite3.connect(self.db_path) as conn:
            df = pd.read_sql_query("SELECT * FROM products", conn)
        return df
    
    def get_basic_stats(self) -> Dict:
        """获取基础统计信息"""
        df = self._load_products()
        if df.empty:
            return {"msg": "⚠️ 数据库中暂无商品数据"}
        return {
            "总采集商品数": len(df),
            "商品均价(元)": round(df['price'].mean(), 2),
            "商品最高单价(元)": round(df['price'].max(), 2),
            "商品最低单价(元)": round(df['price'].min(), 2),
            "采集商品最多的分类": df['category'].value_counts().idxmax(),
            "各分类采集数": df['category'].value_counts().to_dict()
        }
    
    def plot_price_by_category(self):
        """绘制各分类的价格箱线图,并保存为图片"""
        df = self._load_products()
        if df.empty:
            return
        plt.figure(figsize=(10, 6))
        sns.boxplot(x='category', y='price', data=df, palette='pastel')
        plt.title('各分类商品价格分布(箱线图)')
        plt.xlabel('商品分类')
        plt.ylabel('价格(元)')
        plt.tight_layout()
        plt.savefig('price_by_category.png', dpi=300)
        plt.show()
        print("📈 各分类价格分布图已保存为 price_by_category.png")

if __name__ == "__main__":
    analytics = QuickAnalytics()
    stats = analytics.get_basic_stats()
    print("📊 快速统计报告:\n", json.dumps(stats, ensure_ascii=False, indent=4))
    analytics.plot_price_by_category()

💡 Tips: If you encounter font problems, you can install the corresponding Chinese fonts, or directlyplt.rcParams['font.sans-serif']Change it to the existing Chinese font name in the system.


3. Quick Deployment Guide

Environment preparation

  • Hardware/Software: A Windows/macOS/Linux computer, an Android phone or emulator with USB debugging turned on (turned on in Developer Options).
  • Python environment: Python 3.8 and above (3.9 - 3.11 recommended for better compatibility).
  • Depends on installation:
# 安装核心依赖(uiautomator2 会自动安装 google-api 相关库)
pip install uiautomator2 pandas matplotlib seaborn
# 首次运行需要往手机安装 ATX 辅助服务(只需执行一次)
python -m uiautomator2 init

Executepython -m uiautomator2 initAfterwards, an application called "ATX" will be installed on the phone to monitor automation instructions.

Run steps

  1. Connect the device: Connect the phone to the computer with a USB data cable. When "Allow USB debugging" pops up on the phone, please click Allow. if there isadbenvironment, you can enter it in the terminaladb devicesConfirm that the device is recognized.
  2. Modify configuration: Opencore_scraper.py, find the last line ofcustom_config, modify according to your own needs:
    • app_package: The package name of the target e-commerce app (such as Taobaocom.taobao.taobao
    • category_keywords: List of product keywords you want to collect
    • max_products_per_category: The maximum number of products collected for each keyword
  3. Execute collection:
python core_scraper.py
  1. View analysis results: After the collection is completed, run the following command to generate statistical reports and price distribution charts.
python quick_analytics.py

4. Simple anti-automation tips (optional)

If you want to further reduce the risk of being detected by the platform, you can try these lightweight optimizations (some of which are already implemented in the code):

  1. Randomized sliding parameters: Random offsets are added to the sliding starting point, end point, and interval time to make the operation more like a real person.
  2. Occasional simulation pauses or "wrong clicks": In_simulate_scroll_downThe gap can be randomly added with an additional pause of 0.5 ~ 1.5 seconds.
  3. Modify ATX service characteristics: Some platforms will detect ATX related processes or package names. Advanced gameplay cancom.github.uiautomatorRepackage or modify resources.
  4. Control the total collection duration and period: A single continuous collection is not recommended to exceed 1 hour. It is best to divide it into multiple time periods, with a long enough random rest interval.

⚙️ Note: The solution in this article is only suitable for legal scenarios such as learning and research. Excessively frequent automated operations may still violate platform regulations. Please be sure to control collection behavior and respect platform rules.