The latest Ajax crawling technology tutorial in 2023
Preface
You must have encountered such a scene: usingrequestsWhen requesting a web page (such as Weibo, Douyin web version, Xiaohongshu old version list), the returned HTML has only an empty skeleton, and the text and list data seem to have disappeared out of thin air. This situation is mostly because the website uses Ajax (Asynchronous JavaScript and XML) to dynamically load content - the server does not stuff all the data into HTML at the beginning, but waits for the page to be loaded, and then the front-end quietly initiates a request to the back-end interface, gets the JSON or XML data and renders it on the page.
This tutorial will take you from "Introduction to Packet Capture with Developer Tools" to "Practical Weibo Mobile Terminal", covering the mainstream Ajax analysis methods and basic anti-crawling coping skills in 2023~2024. There are no complicated formulas and you can get started in 30 minutes!
1. Modern Ajax request analysis technology
The core idea of Ajax is "the front-end asynchronously requests the back-end interface to get data", so our first step is to find this hidden interface. The "Developer Tools" that come with all modern browsers (Chrome, Edge, and Firefox are acceptable, Chrome is recommended) are our best assistants.
1.1 Quick opening of developer tools
There is no need to recall the complicated right-click menu sequence, just remember these shortcut keys:
- Windows / Linux:
F12orCtrl + Shift + I - macOS:
Cmd + Option + I
Professional operation process:
- First open the target page (such as Weibo mobile personal homepage:
https://m.weibo.cn/u/2830678474)。 - Press the shortcut key to launch the developer tools and switch to the Network panel at the top.
- Press
Ctrl + R(Windows/Linux) orCmd + R(macOS) Force refresh the page - Only in this way can all requests triggered during the page loading process be fully captured, including static resources and dynamic interfaces.
1.2 Quickly filter hidden Ajax interfaces
After refreshing, the Network panel will list a dense list of requests (CSS, JS, images, fonts...), and it is too inefficient to directly search for interfaces with the naked eye. Make good use of the following filter tags to locate your target instantly:
- Fetch/XHR: Covers 99% of modern dynamic interfaces, including traditional
XMLHttpRequestand newfetch API。 - WS: If the web page content is obtained through WebSocket two-way real-time communication (such as chat messages, live broadcast barrages), click this label.
- GraphQL: Some new websites (such as some new GitHub pages, Notion) will use GraphQL. You can manually click "Filter" in the filter bar and check "GraphQL".
FilteredFetch/XHRAfter that, the remaining requests are basically the dynamic interface we want.
1.3 Quickly determine whether the interface is valid
Faced with a long list of interfaces, how to quickly identify the real data interfaces such as "text list" and "user information"? Try these three tips:
-
Look at the request method and URL characteristics Most data interfaces use
GET(get data) orPOST(Submit complex parameters), often appear in the URL/api/、/v2/、/feed/、/list/、/user/and other keywords. -
See response preview (Preview) Click a request in the Network panel, and then switch to the Preview tab on the right. If you see familiar content, such as the blogger's Weibo text and user avatar URL, then Congratulations, the target interface has been found!
-
Copy curl command to assist debugging If you are worried that you missed the request header, you can right-click on the useful request and select Copy → Copy as cURL (bash). In this way, you can get a request template that is exactly the same as the browser. It is also very convenient to convert it into Python code later.
2. 2023~2024 Mainstream anti-crawling response basics
Finding the interface is only the first step. Many websites will set up anti-crawling mechanisms: everything may be fine if you open the interface address directly in the browser, but when you make a request using Python, 403, 401 or empty data will be returned. Here are some introductory but very practical solutions for you.
2.1 Complete simulation of browser requests (most commonly used)
Most entry-level anti-crawling methods (such as checkingUser-Agent、Referer、CookieThese request headers) can be easily obtained by converting the cURL command just copied into Python code.
It is recommended to use a free tool for one-click conversion: curlconverter.com
There are two points to note when converting:
- If cookies are present in the request, do not directly hardcode cookies that may expire quickly into the code. Can be used
httpxofcookiejarto manage. - Be sure to add
http2=TrueParameter, because many new websites have mandatory HTTP/2 protocol, if not enabled, it may directly result in 403.
Below is a generic complete mock request template using an async library that supports HTTP/2httpx,ComparerequestsMuch faster:
2.2 Getting started with dynamic parameters
If it still fails after simulating the complete request header, it is most likely that the interface contains dynamic parameters, such as those that will change with each request.sign、token、_twait. For entry-level dynamic parameters, you can try to use PyExecJS to directly execute the encrypted JS on the page to solve the problem:
- In the Sources panel of the developer tools, use
Ctrl + Shift + FGlobal search parameter names (e.g.sign) to find the JavaScript function that generated the parameter. - Copy this JS function and related dependency code, and pay attention to complete other variables or functions it depends on.
- Use PyExecJS to execute this JS and calculate the dynamic parameters required for the current request.
As a simple example, hypothesis generationtokenThe function isgetToken(timestamp):
3. Weibo mobile terminal actual combat (valid for personal testing in December 2023)
The theory is almost here, now we will use The homepage of a public blogger on the Weibo mobile terminal (https://m.weibo.cn/u/2830678474, does not involve personal privacy) to do a complete practical exercise.
3.1 Capture packets to find the target interface
Follow the steps from 1.1 to 1.3:
- Open the target page → launch developer tools → switch to Network → filter Fetch/XHR → force refresh.
- Click on several requested Previews in sequence and find that
/api/feed/profileThis interface returns the HTML fragment of the blogger's Weibo list (yes, although some interfaces look like APIs, the response content is HTML instead of pure JSON). - Switch to the Headers tab and record the requested URL, query parameters and key request headers.
The core information of the target interface is as follows:
- Request Method:
GET - URL:
https://m.weibo.cn/api/feed/profile - Query Parameters:
uid(Blogger ID, required),page(page number, starting from 1) - Required request headers:
User-Agent(Mobile UA),Referer(blogger’s homepage address),X-Requested-With(XMLHttpRequest, indicating that this is an Ajax request)
3.2 Complete Python implementation code
Combine the general request template with the packet capture results, and then useparselParse the returned HTML fragment to get the complete crawling script:
4. Legal and moral red lines (must read!)
The technology itself is neutral, but those who use the technology must abide by the rules, otherwise serious legal risks may arise. Please remember the following points:
- Comply with robots.txt: Visit the target website before crawling
https://域名/robots.txt, see if the path you want to access is explicitly prohibited. - Set a reasonable crawl interval: It is recommended to at least
3 秒 / 请求, do not put unnecessary pressure on the target server. - Never crawl personal privacy data: such as mobile phone number, ID number, private friend circle or private Weibo, etc.
- Comply with relevant laws and regulations: When collecting data in China, you must comply with laws and regulations such as the Data Security Law, Personal Information Protection Law, and Cyber Security Law.
Summarize
This tutorial takes you from scratch to complete the basics of modern Ajax crawling:
- Use developer tools to find hidden Ajax interfaces (Filter Fetch/XHR → View Preview)
- Complete simulation of browser requests (curl one-click conversion → Add
http2=True) - Entry-level dynamic parameter response (use PyExecJS to execute front-end encryption logic)
- Abide by legal and moral red lines
If you encounter more complex anti-crawling methods in actual combat (such as TLS fingerprint recognition, WebAssembly encryption, behavioral verification codes, etc.), you can follow our subsequent advanced tutorials to overcome the problems step by step!

