Douyin APP packet capture analysis practice

Regardless of content analysis or user behavior research, the data of top short video APPs such as Douyin is almost unavoidable - unfortunately, the web version either has functional castration, or the anti-crawling mechanism is updated every day, while the APP side seems to be closer to the real data, but it faces the threshold of SSL pinning, complex signatures, and device binding.

Today we will start with the "Entry-level Minimalist Solution" to help you clarify a complete idea: Capture packet traffic → Identify core API → Parse packet capture response → Avoid signature capture data. At the end of the article, a Python parsing tool that can be run directly with packet capture data is attached to facilitate verification of data structure and extraction logic.


1. Pre-operation: First get “understandable plaintext traffic”

Packet capture is the first step in all API analysis and crawler development. If this step fails, all the rest will be in vain.

Why choose "Old Combination"?

Mainly to bypass the two most basic anti-climbing thresholds of Douyin:

  1. System certificate trust restrictions Android 12 and above no longer trust the CA certificate installed by the user by default, and the certificate must be stuffed into the system directory after rooting; the same is true for iOS, which is difficult to handle without jailbreaking. Therefore, choosing Android 9 and below, or an old emulator/old real machine with Root/Frida is the most worry-free method.

  2. SSL Pinning (Certificate Pinning) The new version of Douyin after mid-2024 will basically all have certificate locking turned on. It will only recognize the built-in Bytedance root certificate. It will be useless even if you install the CA of the packet capture tool. Prioritize looking for older versions of APKs from 23.5.0 to 23.9.0, which can bypass most pinning restrictions.

Minimalist entry-level solution: Thunderbolt 9 emulator + Mitmproxy + old version of Douyin

This combination has the lowest threshold, and novices can most likely clear the traffic within 15 minutes.

1. Tool installation and basic configuration

RolesRecommended ToolsInstallation/Configuration Essentials
Traffic capture + HTTPS decryptionMitmproxy / MitmwebDownload the latest version of Mitmproxy and unzip it, run it in terminal / CMDmitmweb -p 8080(The Web monitoring page will be automatically opened on port 8081)
Simulator WiFi proxy settingsThunderbolt Simulator 9 (Android 9)"Settings" in the upper right corner → "WiFi" → long press the defaultLeidianWiFi→ "Modify Network" → "Advanced Options" → Select "Manual" for the proxy, and fill in the LAN IP of your local WLAN for IPv4 (Windows:cmdipconfig→ Find the WLAN adapterIPv4 地址), fill in the port8080
Simulator installs CA certificateMitmproxy comes with CAAccess in simulator browsermitm.it→ SelectAndroidDownload the certificate → Drag it into the simulator "File Transfer" → Find it and install it. Name it whatever you want, but remember it is a "root certificate"

💡 Common pitfalls:

  • Do not fill in the proxy IP127.0.0.1, that points to the emulator itself, and you need to fill in the LAN IP of your host machine.
  • When installing the certificate, the system may ask you to set a lock screen password. Just follow the prompts to set it.

2. Verify whether the configuration is successful

  1. First open Mitmweb’s Web monitoring page (http://localhost:8081) to see if any basic network requests pop up.
  2. Install an old version of Douyin around 23.7.0 (don’t search in the app store, go to APKPure or the historical version site to find the installation package).
  3. Open Douyin, browse 3 to 5 recommended videos, return to the Web monitoring page, and see if there are anyaweme.snssdk.comThe request** at the beginning and the returned content is plain text JSON - if there is, it means that the pun has been passed and you can continue.

2. Traffic filtering: only capture useful "core APIs"

A single packet capture will generate hundreds or thousands of requests. Don't panic. Focus on the key domain names and return formats first.

Three principles of rapid filtering

  • Static resources are ignored directly:p*.douyinpic.com(picture),v*.douyinvod.com(Video file) This type of request is only downloading material and has little value for data capture.
  • Leave only JSON plain text: focus aweme.snssdk.combegins, and Content-Type isapplication/json interface.
  • Recording high-frequency and easy-to-use core interfaces, let’s take a look at the following ones:
FunctionInterface path fragmentKey required parameters
User homepage information/aweme/v1/user/sec_user_id(Cryptographically stable user ID, much better than purely numericuser_idreliable)
All videos posted by users (paginated)/aweme/v1/aweme/post/sec_user_idcount(The number of single pages, the maximum is generally 20~30),max_cursor(Page cursor, first pass 0)
Single video details/aweme/v1/aweme/detail/aweme_id(Video Unique ID)
Recommended streaming data/aweme/v1/feed/type=0count

📌 Simple understanding:

  • sec_user_idIt is a highly stable version of user ID and will not become invalid due to changes in purely numerical IDs. Use it first.
  • max_cursorIt is a paging cursor, and there will be ahas_moreFields and to be used in the next requestmax_cursor, you can turn pages according to the input.

3. Code practice: "API response parsing tool" after packet capture

⚠️ Special statement

Douyin’s real signature algorithm (like_signaturex-gorgonx-khronosx-ss-stubetc.) Extremely complex, involving the native layer and dynamically generated functions, it is almost impossible for novices to reproduce it in a short time. Therefore, what is provided here is a tool that "parses + saves the real and complete URL/JSON obtained by packet capture" - it does not touch signatures at all, and directly uses the plaintext data you have captured to help you verify the API response structure and data extraction logic.

Complete Python code

import requests
import json
import csv
from typing import Dict, Any, List, Optional
from datetime import datetime


class DouyinDataParser:
    """抓包后抖音API响应解析与结构化保存工具"""

    @staticmethod
    def extract_user_posts(
        response_data: Dict[str, Any],
        only_save_public: bool = True
    ) -> List[Dict[str, Any]]:
        """
        从「用户发布视频列表」接口提取结构化数据
        :param response_data: 抓包后 JSON 转成的 Python 字典
        :param only_save_public: 是否只保存公开可见的视频
        :return: 结构化后的视频列表
        """
        structured_videos = []
        raw_aweme_list = response_data.get("aweme_list", [])

        if not raw_aweme_list:
            print("未找到 aweme_list 字段,请检查是否传入了正确的接口响应!")
            return []

        for aweme in raw_aweme_list:
            try:
                # 过滤非公开视频
                if only_save_public and aweme.get("is_top", 0) != 1 and aweme.get("status", {}).get("allow_share", 1) != 1:
                    continue

                # 转换时间戳
                create_datetime = datetime.fromtimestamp(
                    aweme.get("create_time", 0)
                ).strftime("%Y-%m-%d %H:%M:%S")

                # 提取数据
                video_data = {
                    "视频ID": aweme.get("aweme_id"),
                    "视频标题": aweme.get("desc", "").strip().replace("\n", " "),
                    "发布时间": create_datetime,
                    "作者UID": aweme.get("author", {}).get("uid"),
                    "作者昵称": aweme.get("author", {}).get("nickname", "").strip(),
                    "作者抖音号": aweme.get("author", {}).get("unique_id", "").strip(),
                    "视频播放地址(CDN链接,可能失效较快)": aweme.get("video", {}).get("play_addr", {}).get("url_list", [None])[0],
                    "视频封面地址": aweme.get("video", {}).get("cover", {}).get("url_list", [None])[0],
                    "点赞数": aweme.get("statistics", {}).get("digg_count", 0),
                    "评论数": aweme.get("statistics", {}).get("comment_count", 0),
                    "转发数": aweme.get("statistics", {}).get("share_count", 0),
                    "播放数": aweme.get("statistics", {}).get("play_count", 0),
                    "是否置顶": "是" if aweme.get("is_top", 0) == 1 else "否",
                }
                structured_videos.append(video_data)
            except Exception as e:
                print(f"解析单个视频数据失败,跳过:{str(e)}")
                continue

        return structured_videos

    @staticmethod
    def save_to_csv(
        data: List[Dict[str, Any]],
        filename: Optional[str] = None
    ) -> None:
        """
        把结构化数据保存到 CSV 文件
        :param data: extract_* 函数返回的结构化列表
        :param filename: 保存的文件名,默认带时间戳
        """
        if not data:
            print("没有可保存的数据!")
            return

        if not filename:
            filename = f"douyin_posts_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"

        try:
            # 用 UTF-8 with BOM 保存,避免 Excel 打开乱码
            with open(filename, mode="w", newline="", encoding="utf-8-sig") as f:
                writer = csv.DictWriter(f, fieldnames=data[0].keys())
                writer.writeheader()
                writer.writerows(data)
            print(f"数据保存成功!文件路径:{filename}")
        except Exception as e:
            print(f"保存 CSV 失败:{str(e)}")


def main():
    print("=" * 60)
    print("抖音抓包后 API 响应解析工具(仅用于学习 API 结构)")
    print("=" * 60)
    print("\n使用步骤:")
    print("1. 完成前置抓包,拿到「用户发布视频列表」接口的完整响应 JSON;")
    print("2. 把 JSON 复制到当前目录下的「response.json」文件中;")
    print("3. 运行本工具即可自动解析并保存为 CSV。")
    print("=" * 60 + "\n")

    # 读取本地 JSON 文件
    try:
        with open("response.json", mode="r", encoding="utf-8") as f:
            raw_response = json.load(f)
    except FileNotFoundError:
        print("❌ 未找到「response.json」文件,请按使用步骤操作!")
        return
    except json.JSONDecodeError:
        print("❌ 「response.json」格式错误,请检查是否是有效的 JSON!")
        return

    # 解析数据
    parser = DouyinDataParser()
    posts = parser.extract_user_posts(raw_response)
    print(f"\n✅ 成功解析 {len(posts)} 条视频数据!")

    # 保存到 CSV
    if posts:
        parser.save_to_csv(posts)


if __name__ == "__main__":
    main()

🛠 Usage:

  1. Find it once from the Mitmweb monitoring page/aweme/v1/aweme/post/The response body of the interface is copied completely;
  2. Create a new one in the script directoryresponse.json, paste the copied content into it and save it;
  3. Run the script and a CSV file with timestamp will be generated in the current directory, which contains structured video information.

4. Subsequent advancement and compliance tips

Advanced data retrieval scheme (avoid/resolve signature)

If you don't want to just stay in the semi-automatic state of "first capture packets, then parse" and want to achieve semi-automatic or even fully-automatic data collection, you can pay attention to the following two relatively novice-friendly directions:

  1. Appium + Mitmproxy linkage Use Appium to simulate the sliding and clicking operations of real people. Mitmproxy intercepts the requests and responses of the real interface in the middle and saves the structured data directly. This method does not require reproducing the signature algorithm at all, because you are always hijacking legitimate requests made by the real APP.

  2. Frida Hook signature function If you already have a certain foundation in reverse engineering, you can try using Frida hook on Douyinlibxgorgon.soOr the Java layer's signature generation class. Call these native functions in real time to generate realx-gorgonWait for the parameters, and then cooperate with PythonrequestsBy simulating the request, you can run through the interface separately from the packet capture.

⚠️ Compliance Tips (Very Important)

Please be sure to comply with relevant laws and regulations such as the Cybersecurity Law of the People's Republic of China, the Data Security Law of the People's Republic of China, and the Personal Information Protection Law of the People's Republic of China:

  1. Only crawl publicly visible non-sensitive data;
  2. Do not make high-frequency requests to avoid affecting Douyin’s normal services;
  3. Do not use the captured data for commercial purposes;
  4. Do not spread crawled personal information (such as mobile phone number, address, real-name information, etc.).

The above is the practical idea of ​​​​douyin APP packet capture analysis from zero to one. There are actually three core points: Choose to bypass basic protection for older versions, Use Mitmproxy to see API traffic, Use packet capture data to verify the parsing logic first, and then consider automation. As long as these points are firmly established, whether it is data analysis or advanced reverse engineering, it will be much easier.