The latest Ajax crawling technology tutorial in 2023

Preface

You must have encountered such a scene: usingrequestsWhen requesting a web page (such as Weibo, Douyin web version, Xiaohongshu old version list), the returned HTML has only an empty skeleton, and the text and list data seem to have disappeared out of thin air. This situation is mostly because the website uses Ajax (Asynchronous JavaScript and XML) to dynamically load content - the server does not stuff all the data into HTML at the beginning, but waits for the page to be loaded, and then the front-end quietly initiates a request to the back-end interface, gets the JSON or XML data and renders it on the page.

This tutorial will take you from "Introduction to Packet Capture with Developer Tools" to "Practical Weibo Mobile Terminal", covering the mainstream Ajax analysis methods and basic anti-crawling coping skills in 2023~2024. There are no complicated formulas and you can get started in 30 minutes!


1. Modern Ajax request analysis technology

The core idea of ​​Ajax is "the front-end asynchronously requests the back-end interface to get data", so our first step is to find this hidden interface. The "Developer Tools" that come with all modern browsers (Chrome, Edge, and Firefox are acceptable, Chrome is recommended) are our best assistants.

1.1 Quick opening of developer tools

There is no need to recall the complicated right-click menu sequence, just remember these shortcut keys:

  • Windows / LinuxF12orCtrl + Shift + I
  • macOSCmd + Option + I

Professional operation process:

  1. First open the target page (such as Weibo mobile personal homepage:https://m.weibo.cn/u/2830678474)。
  2. Press the shortcut key to launch the developer tools and switch to the Network panel at the top.
  3. PressCtrl + R(Windows/Linux) orCmd + R(macOS) Force refresh the page - Only in this way can all requests triggered during the page loading process be fully captured, including static resources and dynamic interfaces.

1.2 Quickly filter hidden Ajax interfaces

After refreshing, the Network panel will list a dense list of requests (CSS, JS, images, fonts...), and it is too inefficient to directly search for interfaces with the naked eye. Make good use of the following filter tags to locate your target instantly:

  1. Fetch/XHR: Covers 99% of modern dynamic interfaces, including traditionalXMLHttpRequestand newfetch API
  2. WS: If the web page content is obtained through WebSocket two-way real-time communication (such as chat messages, live broadcast barrages), click this label.
  3. GraphQL: Some new websites (such as some new GitHub pages, Notion) will use GraphQL. You can manually click "Filter" in the filter bar and check "GraphQL".

FilteredFetch/XHRAfter that, the remaining requests are basically the dynamic interface we want.

1.3 Quickly determine whether the interface is valid

Faced with a long list of interfaces, how to quickly identify the real data interfaces such as "text list" and "user information"? Try these three tips:

  1. Look at the request method and URL characteristics Most data interfaces useGET(get data) orPOST(Submit complex parameters), often appear in the URL/api//v2//feed//list//user/and other keywords.

  2. See response preview (Preview) Click a request in the Network panel, and then switch to the Preview tab on the right. If you see familiar content, such as the blogger's Weibo text and user avatar URL, then Congratulations, the target interface has been found!

  3. Copy curl command to assist debugging If you are worried that you missed the request header, you can right-click on the useful request and select Copy → Copy as cURL (bash). In this way, you can get a request template that is exactly the same as the browser. It is also very convenient to convert it into Python code later.


2. 2023~2024 Mainstream anti-crawling response basics

Finding the interface is only the first step. Many websites will set up anti-crawling mechanisms: everything may be fine if you open the interface address directly in the browser, but when you make a request using Python, 403, 401 or empty data will be returned. Here are some introductory but very practical solutions for you.

2.1 Complete simulation of browser requests (most commonly used)

Most entry-level anti-crawling methods (such as checkingUser-AgentRefererCookieThese request headers) can be easily obtained by converting the cURL command just copied into Python code.

It is recommended to use a free tool for one-click conversion: curlconverter.com

There are two points to note when converting:

  1. If cookies are present in the request, do not directly hardcode cookies that may expire quickly into the code. Can be usedhttpxofcookiejarto manage.
  2. Be sure to addhttp2=TrueParameter, because many new websites have mandatory HTTP/2 protocol, if not enabled, it may directly result in 403.

Below is a generic complete mock request template using an async library that supports HTTP/2httpx,ComparerequestsMuch faster:

import httpx

async def fetch_api_data(url: str, method: str = "GET", params: dict = None, headers: dict = None, data: dict = None):
    """
    通用异步模拟请求函数
    :param url: 目标接口地址
    :param method: 请求方法(GET 或 POST)
    :param params: GET 请求的查询参数
    :param headers: 完整请求头(建议从 curlconverter 复制过来)
    :param data: POST 请求的表单数据
    """
    async with httpx.AsyncClient(http2=True, follow_redirects=True) as client:
        try:
            resp = await client.request(
                method=method,
                url=url,
                params=params,
                headers=headers,
                data=data,
                timeout=10.0  # 设置超时,防止卡死
            )
            resp.raise_for_status()  # 遇到 4xx/5xx 自动抛出异常
            return resp.json()       # 大多数接口返回 JSON,直接解析
        except httpx.HTTPStatusError as e:
            print(f"请求失败,状态码:{e.response.status_code}")
        except Exception as e:
            print(f"其他错误:{e}")

2.2 Getting started with dynamic parameters

If it still fails after simulating the complete request header, it is most likely that the interface contains dynamic parameters, such as those that will change with each request.signtoken_twait. For entry-level dynamic parameters, you can try to use PyExecJS to directly execute the encrypted JS on the page to solve the problem:

  1. In the Sources panel of the developer tools, useCtrl + Shift + FGlobal search parameter names (e.g.sign) to find the JavaScript function that generated the parameter.
  2. Copy this JS function and related dependency code, and pay attention to complete other variables or functions it depends on.
  3. Use PyExecJS to execute this JS and calculate the dynamic parameters required for the current request.

As a simple example, hypothesis generationtokenThe function isgetToken(timestamp)

import execjs
import time

# 补全依赖后的 JS 代码
js_code = """
// 假设这是从 Sources 找到的生成 token 的函数
function getRandomStr() {
    return Math.random().toString(36).substr(2, 9);
}
function getToken(t) {
    return getRandomStr() + t.toString().substr(-6);
}
"""

# 编译 JS
ctx = execjs.compile(js_code)
# 调用函数,传入当前时间戳(单位:秒)
timestamp = int(time.time())
token = ctx.call("getToken", timestamp)
print(f"生成的动态 token:{token}")

3. Weibo mobile terminal actual combat (valid for personal testing in December 2023)

The theory is almost here, now we will use The homepage of a public blogger on the Weibo mobile terminal (https://m.weibo.cn/u/2830678474, does not involve personal privacy) to do a complete practical exercise.

3.1 Capture packets to find the target interface

Follow the steps from 1.1 to 1.3:

  1. Open the target page → launch developer tools → switch to Network → filter Fetch/XHR → force refresh.
  2. Click on several requested Previews in sequence and find that/api/feed/profileThis interface returns the HTML fragment of the blogger's Weibo list (yes, although some interfaces look like APIs, the response content is HTML instead of pure JSON).
  3. Switch to the Headers tab and record the requested URL, query parameters and key request headers.

The core information of the target interface is as follows:

  • Request Method:GET
  • URLhttps://m.weibo.cn/api/feed/profile
  • Query Parameters:uid(Blogger ID, required),page(page number, starting from 1)
  • Required request headers:User-Agent(Mobile UA),Referer(blogger’s homepage address),X-Requested-WithXMLHttpRequest, indicating that this is an Ajax request)

3.2 Complete Python implementation code

Combine the general request template with the packet capture results, and then useparselParse the returned HTML fragment to get the complete crawling script:

import httpx
from parsel import Selector
import asyncio

# ---------------------- 配置参数 ----------------------
UID = "2830678474"          # 目标博主 ID
MAX_PAGE = 2                # 最大爬取页数
HEADERS = {
    "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Mobile/15E148 Safari/604.1",
    "X-Requested-With": "XMLHttpRequest",
    "Referer": f"https://m.weibo.cn/u/{UID}",
}

# ---------------------- 核心函数 ----------------------
async def fetch_weibo_page(uid: str, page: int = 1):
    """爬取单页微博数据"""
    async with httpx.AsyncClient(http2=True, follow_redirects=True) as client:
        try:
            resp = await client.get(
                url="https://m.weibo.cn/api/feed/profile",
                params={"uid": uid, "page": page},
                headers=HEADERS,
                timeout=10.0
            )
            resp.raise_for_status()
            return resp.json()
        except Exception as e:
            print(f"第{page}页爬取失败:{e}")
            return None

def parse_weibo_html(html: str):
    """解析微博 HTML 片段,提取有用信息"""
    selector = Selector(text=html)
    weibo_list = []
    # 遍历每个微博卡片
    for card in selector.css(".card-wrap"):
        # 过滤掉广告卡片(包含 mblog-tag--ad 类名)
        if card.css(".mblog-tag--ad"):
            continue
        weibo_list.append({
            "微博ID": card.css(".card::attr(mid)").get(),
            "发布时间": card.css(".time::text").get(),
            "微博正文": "".join(card.css(".weibo-text::text, .weibo-text a::text").getall()).strip(),
            "点赞数": card.css(".like-count::text").get("0"),
            "评论数": card.css(".comment-count::text").get("0"),
            "转发数": card.css(".repost-count::text").get("0"),
        })
    return weibo_list

async def main():
    """主函数:多页爬取"""
    all_weibos = []
    for page in range(1, MAX_PAGE + 1):
        print(f"正在爬取第{page}页...")
        data = await fetch_weibo_page(UID, page)
        if not data or not data.get("data", {}).get("cards"):
            print(f"第{page}页没有数据,停止爬取")
            break
        # 2023 年 12 月时,接口返回的 cards 中第一个元素是置顶微博(isTop:1),后续才是普通微博
        # 这里把所有卡片中的 HTML 片段拼接起来,统一解析
        full_html = "".join([
            card.get("mblog", {}).get("text", "")
            for card in data["data"]["cards"]
            if card.get("mblog")
        ])
        page_weibos = parse_weibo_html(full_html)
        all_weibos.extend(page_weibos)
        # 加上 3 秒延迟,避免请求过快被封
        await asyncio.sleep(3)

    # 打印最终结果
    print(f"\n爬取完成,共获取 {len(all_weibos)} 条有效微博:")
    for idx, weibo in enumerate(all_weibos, 1):
        print(f"\n【第{idx}条】")
        for key, value in weibo.items():
            print(f"{key}{value}")

if __name__ == "__main__":
    asyncio.run(main())

The technology itself is neutral, but those who use the technology must abide by the rules, otherwise serious legal risks may arise. Please remember the following points:

  1. Comply with robots.txt: Visit the target website before crawlinghttps://域名/robots.txt, see if the path you want to access is explicitly prohibited.
  2. Set a reasonable crawl interval: It is recommended to at least3 秒 / 请求, do not put unnecessary pressure on the target server.
  3. Never crawl personal privacy data: such as mobile phone number, ID number, private friend circle or private Weibo, etc.
  4. Comply with relevant laws and regulations: When collecting data in China, you must comply with laws and regulations such as the Data Security Law, Personal Information Protection Law, and Cyber ​​Security Law.

Summarize

This tutorial takes you from scratch to complete the basics of modern Ajax crawling:

  1. Use developer tools to find hidden Ajax interfaces (Filter Fetch/XHR → View Preview)
  2. Complete simulation of browser requests (curl one-click conversion → Addhttp2=True
  3. Entry-level dynamic parameter response (use PyExecJS to execute front-end encryption logic)
  4. Abide by legal and moral red lines

If you encounter more complex anti-crawling methods in actual combat (such as TLS fingerprint recognition, WebAssembly encryption, behavioral verification codes, etc.), you can follow our subsequent advanced tutorials to overcome the problems step by step!