title: What is Ajax? description: Modern Web Crawling Technology: Ajax Data Scraping Guide

From "What is Ajax" to practical crawler practice: a one-stop introductory guide

Many friends who have just started web development or crawling will encounter the keyword "Ajax data" - they can clearly see the content on the web page with the naked eye, but right-click "View web page source code" but it is empty. This article will help you understand it thoroughly: **What is Ajax? ** How does it work in modern web pages? And how can crawler developers efficiently capture this type of dynamic data? **


1. What is Ajax?

Ajax (Asynchronous JavaScript and XML) is not an independent new technology, but a set of Web development solutions that update page content through asynchronous communication without refreshing the entire page.

It is composed of the following technologies:

  1. Communication Core:XMLHttpRequestor modernFetch API, responsible for exchanging data with the server in the background.
  2. Document Model: DOM (Document Object Model), responsible for locating and dynamically modifying a certain part of the web page.
  3. Scripting language: JavaScript, strings together the entire process of communication, data analysis, and page modification.
  4. Data format: In the early days, XML was mostly used, but now it is almost all JSON, and sometimes HTML fragments are also directly transmitted.

Core Values ​​(Why are modern web pages inseparable from it?)

  • No refresh interaction, smooth experience: For example, Douyin’s infinite sliding loading, Weibo’s automatic search completion, each operation will not cause the entire page to reload, and there will be no white screen flickering.
  • Save bandwidth, load faster: Only request the "data itself" from the server, rather than a complete set of HTML, CSS, JS, to avoid repeated transmission of unchanged parts.
  • Separation of front-end and back-end, clearer architecture: The server is only responsible for providing data interfaces, and the front-end is responsible for display and interaction logic, making maintenance and division of labor more efficient.

2. How is Ajax written in modern web pages?

Understanding the front-end Ajax implementation is very helpful for us to find interfaces and write accurate crawlers later. There are currently three mainstream writing methods:

1. Traditional XMLHttpRequest (old but still worth knowing)

Although it is no longer used in most new projects, you will still encounter it in some older systems.

const xhr = new XMLHttpRequest();

// 配置请求:GET 方法、接口地址、异步标志(必须为 true)
xhr.open('GET', 'https://api.example.com/hot-topics?page=1', true);

// 监听完成事件
xhr.onload = function () {
  if (xhr.status === 200) {
    // responseText 是 JSON 字符串,需要解析为 JS 对象
    const topics = JSON.parse(xhr.responseText);
    console.log(topics);
  }
};

// 发送请求(GET 传 null,POST 可将 body 放在 send 里)
xhr.send(null);

2. Modern Fetch API (native ES6+ preferred)

fetch()Based on Promise, chain calls are clearer and are used in almost all new projects.

// 链式写法
fetch('https://api.example.com/hot-topics?page=1')
  .then(res => {
    if (!res.ok) throw new Error('请求失败');
    return res.json();  // 读取出 JSON 数据
  })
  .then(topics => console.log(topics))
  .catch(err => console.error('出错啦:', err));

// async / await 写法(更易读)
async function getHotTopics(page = 1) {
  try {
    const res = await fetch(`https://api.example.com/hot-topics?page=${page}`);
    if (!res.ok) throw new Error(`HTTP 错误码:${res.status}`);
    return await res.json();
  } catch (err) {
    console.error('获取热门话题失败:', err);
  }
}

3. Third-party library encapsulation (development efficiency tool)

For example, Axios has become almost standard in frameworks such as React and Vue. It comes with automatic JSON conversion, interceptors, request cancellation and other functions.

import axios from 'axios';

async function getHotTopics(page = 1) {
  try {
    const { data: topics } = await axios.get('https://api.example.com/hot-topics', {
      params: { page }
    });
    return topics;
  } catch (err) {
    console.error('Axios 请求失败:', err.response?.data || err.message);
  }
}

After understanding these writing methods, you will understand: The data that appears dynamically on the web page are actually requested back from specific interfaces through this type of code. This is also the "interface" that our crawler aims at.


3. How do crawler developers capture Ajax data?

Since Ajax data is requested and inserted through JavaScript after the page is loaded, it cannot be obtained by directly parsing HTML. There are two main types of coping methods:

✅ Option 1: Directly call the Ajax interface (the most efficient and preferred)

Just like before the front-end request is sent, we directly intercept and copy the request logic - use a crawler to simulate the browser, directly send a request to the interface, and get back pure JSON/XML data. This solution is the fastest and has the lowest resource consumption.

Step 1: Find the target interface

This technology relies entirely on the developer tools that come with the browser (F12 / right click → Inspect):

  1. Open the Network panel.
  2. Click the Fetch/XHR filter to only see asynchronous data requests.
  3. Perform operations that trigger Ajax on the web page (such as scrolling down to load more, entering keywords to search, and clicking on comments to paginate).
  4. Observe the new requests that appear in the panel, click to view the Response label, and find the interface that contains the data you want.

Step 2: "Translate" the interface request into crawler code

After finding the interface, you can directly right-click → Copy → Copy as cURL (bash) on the request, and then use an online tool (such as curlconverter) to convert it to Python with one clickrequestsThe code can be adjusted slightly.

Here's a generic example with manual adjustments:

import requests
import json

# 1. 设置请求头,模拟真实浏览器
headers = {
    "User-Agent": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/128.0.0.0 Safari/537.36"),
    # 某些网站会检查这个头,用来判断是否为 Ajax 请求
    "X-Requested-With": "XMLHttpRequest",
    "Accept": "application/json, text/plain, */*",
    # 若接口需要登录,还需添加 Cookie 或 Authorization
}

# 2. 构造参数:GET 请求用 params,POST 请求用 json/data
params = {
    "page": 2,
    "size": 15,
    "category": "tech"
}

# 3. 发送请求,并做好exception-handling
try:
    resp = requests.get(
        url="https://api.example.com/hot-topics",
        headers=headers,
        params=params,
        timeout=10  # 设置超时,避免程序卡死
    )
    resp.raise_for_status()   # 自动检查 HTTP 状态码
    data = resp.json()        # 解析 JSON 数据
    print(json.dumps(data, indent=2, ensure_ascii=False))
except requests.exceptions.RequestException as e:
    print(f"请求出错:{e}")

✅ Solution 2: Use a headless browser to render the page

When the interface parameters are heavily encrypted (such as_signature, dynamically generatedtoken), when manual reverse analysis takes too long, you can ask for a headless browser - a real browser without a graphical interface. It can automatically execute JavaScript on the page like a normal user, automatically initiate Ajax requests, and let the data appear quietly in the DOM. We just need to wait for the data to be loaded and then extract the content directly from the page elements.

Currently the most recommended tool is Playwright (produced by Microsoft), which is more stable and faster than traditional Selenium, and has a more modern API.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # 启动无头 Chromium(省资源、后台运行)
    browser = p.chromium.launch(headless=True)
    page = browser.new_page(
        user_agent=("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                    "AppleWebKit/537.36 (KHTML, like Gecko) "
                    "Chrome/128.0.0.0 Safari/537.36")
    )

    # 访问目标页面
    page.goto("https://example.com/ajax-page")

    # ⚠️ 关键:等待 Ajax 加载完成,不要用固定的 time.sleep()
    #     应该等待某个代表数据已存在的元素出现
    page.wait_for_selector(".ajax-content-item", timeout=10000)

    # 从 DOM 中提取数据
    items = page.query_selector_all(".ajax-content-item")
    for item in items:
        title = item.query_selector(".item-title").text_content()
        author = item.query_selector(".item-author").text_content()
        print(f"标题:{title} | 作者:{author}")

    browser.close()

Tips: Playwright can also directly intercept network requests and obtain the JSON data returned by the interface. It takes into account the power of a headless browser and the efficiency of interface crawling. It is very suitable for use in scenarios with strict encryption and anti-crawling.


4. Common pitfalls and best practices for capturing Ajax data

1. Common anti-crawling and countermeasures

  • Dynamic request parameters/encryption First use a headless browser to obtain data stably; if you want to achieve direct connection to the interface, you can use Chrome'sSourcesPanel break points, slowly analyze the encryption logic of front-end JavaScript.

  • Cookie/Token Expired If you run an interface crawler, you can log in with Playwright regularly to refresh cookies; or you can directly use Playwright to simulate user operations throughout the process, eliminating the trouble of manually managing credentials.

  • IP blocked Control the request frequency (recommended to be every 1 to 3 seconds or even longer), and use a proxy IP pool if necessary.

  • Human-machine verification (slider, click, etc.) Simple sliders can be simulated dragging with Playwright; complex verification codes can use third-party OCR services (such as Baidu AI, 2Captcha, etc.), but pay attention to cost and compliance.

2. Best Practice Checklist

  1. Interface Priority Principle: If you can directly adjust the interface, do not open the browser. The efficiency is an order of magnitude different, and it is less likely to trigger anti-crawling.
  2. Farewell to fixed sleep: Use the dynamic waiting API of "waiting for specific elements to appear/interface response to complete" to avoid getting empty data or wasting time.
  3. Robust design: Network fluctuations and temporary interface downtime are commonplace, so be sure to add an exception retry mechanism.
  4. Respectrobots.txt: Visit the target website first/robots.txt, clarify which paths are prohibited from crawling, and make a legal and compliant crawler.
  5. Control the request rate: Not bringing unnecessary burden to the target server is a basic quality of a technical person.

5. Summary

  1. Ajax is a set of technology combinations that allow web pages to be updated without refreshing. The modern core isFetch API + JSON + DOM
  2. When the crawler captures Ajax data, it is preferred to directly call the underlying interface, which is highly efficient and resource-saving; when the interface is seriously encrypted, use Playwright and other headless browsers to cover the situation.
  3. Although tools are good, you must abide by laws, regulations and crawler ethics, and use technology reasonably and restrainedly.

I hope this article can help you say goodbye to the confusion of "there is nothing in the source code" and easily capture Ajax data. If you still have questions, please feel free to discuss in the comment area!