title: Ajax case crawling practice description: Python3 crawler tutorial: Ajax data crawling practice

Python3 crawler tutorial: Ajax data crawling practice

Tips at the beginning If you have written about crawlers, you must have seen this weird scene: userequestsThe web page I pulled down only contains empty code.<div id="app"></div>, where is the data? Where have you gone? Don't worry, this is not a problem with your code, but the web page uses Ajax dynamic loading - this is also a common operation of modern front-end frameworks. This time we will use Mr. Cui Qingcai’s free target site spa1 to dismantle the entire process of the Ajax crawler step by step. The interface is clear and there is no complicated anti-climbing, which is very suitable for practicing.


1. Preparation

Before you start, make sure you have the following things in place on your computer:

  • Python 3.6 and above
  • Installed dependent libraries:
    pip install requests pymongo
  • MongoDB service has been started (This article will save the data to MongoDB. If you don’t want to install the database for the time being, you can go to the step of printing the data first)
  • A basic understanding of the Network panel of the browser developer tools (F12) is enough

2. Target website analysis

Target station address: https://spa1.scrape.center/

2.1 List of website features

  • All data is loaded asynchronously through Ajax without any server-side rendering.
  • The front-end framework of the page is responsible for generating content, and the returned HTML is basically just a "skeleton"
  • Support paginated browsing, about 10 pages, 10 movies per page
  • Clicking on the movie card will jump to the details page. The detailed data also comes from the Ajax interface.

2.2 Fields we want to obtain

From the interface responses on the list page and detail page, you can get complete movie information:

  • movie title
  • Link to cover image
  • Category tags (such as "Drama" and "Suspense")
  • Release date
  • Overall rating
  • Plot synopsis

3. Preliminary verification: Can I get the data by directly requesting the page?

First write a few lines of the simplest code to test and see if you can use it directlyrequestsDoes the downloaded HTML contain the content we want:

import requests

url = 'https://spa1.scrape.center/'
response = requests.get(url)
print(response.text[:500])  # 只打印前 500 个字符,避免刷屏

The result will be an "empty shell" HTML containing only something like<div id="app"></div>Container tags, not even the shadow of keywords such as "movie" and "rating" can be seen. This just confirms our judgment: **The data is not output directly by the server and needs to find the Ajax interface. **


4. Crawl the list page

4.1 Ajax interface for analyzing list pages

Open the browser, press F12 to bring up the developer tools, switch to the Network panel, be sure to check Preserve log (so that the request record will not be cleared when turning pages), and then click XHR/Fetch to filter and only see Ajax requests.

Next, click "Page 2" and "Page 3" on the page, and you will see a batch of requests with a highly uniform format appearing in the Network:

https://spa1.scrape.center/api/movie/?limit=10&offset=0
https://spa1.scrape.center/api/movie/?limit=10&offset=10
https://spa1.scrape.center/api/movie/?limit=10&offset=20

Parameter meaning

  • limit: Fixed to10, representing how many pieces of movie data are returned per page
  • offset: Offset, page 1 is0, page 2 is10, page n is10 × (n - 1)

4.2 Encapsulate list page crawling function

In order to improve code reusability, we first write a general interface request function, and then specifically write a call function for the list page:

import requests
import logging

# 配置日志,方便追踪爬取进度和错误
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s: %(message)s'
)

# 全局常量
INDEX_URL = 'https://spa1.scrape.center/api/movie/?limit={limit}&offset={offset}'
LIMIT = 10  # 每页固定 10 条

def scrape_api(url):
    """通用的 Ajax 接口爬取函数,返回 JSON 数据或 None"""
    logging.info('正在爬取接口:%s', url)
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.json()  # 直接返回解析后的 JSON 字典
        logging.error('爬取失败:接口 %s 返回状态码 %d', url, response.status_code)
    except requests.RequestException:
        logging.error('爬取出错:接口 %s 发生请求异常', url, exc_info=True)
    return None

def scrape_index(page):
    """爬取指定页的电影列表"""
    url = INDEX_URL.format(limit=LIMIT, offset=LIMIT * (page - 1))
    return scrape_api(url)

Now, callscrape_index(1)You can get the 10 movie data on page 1.


5. Crawl the details page

5.1 Ajax interface for analyzing details page

Click on a movie on the list page, enter the details page, and filter the XHR/Fetch request as well. You will find a new request similar to this:

https://spa1.scrape.center/api/movie/1/

Obviously, here1It is the unique ID of the movie. The JSON interface of the list page has already included the content of each movieidThe field is returned to us, so we don't need to parse the jump link from the HTML at all, we can just use this ID to splice the URL.

5.2 Encapsulate the details page crawling function

Continue the logic just now and reuse it.scrape_apiFunction, three lines of code to get it done:

DETAIL_URL = 'https://spa1.scrape.center/api/movie/{id}/'

def scrape_detail(movie_id):
    """爬取指定 ID 的电影详情"""
    url = DETAIL_URL.format(id=movie_id)
    return scrape_api(url)

6. Data storage: Save to MongoDB

The interface of the target site returns a JSON object with a clear structure, which is very suitable for directly plugging into a document database such as MongoDB, and the fields do not require additional processing.

6.1 Configure MongoDB connection

Import firstpymongo, and connect to the local database:

import pymongo

# MongoDB 全局常量
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DB = 'movies_spa1'
MONGO_COLLECTION = 'movies'

# 建立连接并获取集合
client = pymongo.MongoClient(MONGO_URI)
db = client[MONGO_DB]
collection = db[MONGO_COLLECTION]

6.2 Encapsulated data saving function

useupdate_oneCooperateupsert=TrueRealize data deduplication and update:

  • If a movie with the same name already exists in the library, update the field information
  • If not, insert a new record
def save_data(data):
    """将电影数据存入 MongoDB,同名电影去重更新"""
    if not data:
        return
    collection.update_one(
        {'name': data.get('name')},  # 用电影名作为去重条件
        {'$set': data},               # 传入的字段会更新或新增
        upsert=True                    # 若没有匹配记录,则插入新文档
    )
    logging.info('电影《%s》已保存成功!', data.get('name'))

7. Complete crawling process

Finally, string all the functions together and write onemain(), crawl all movies in the first 10 pages at once:

def main():
    for page in range(1, 11):          # 第 1 页到第 10 页
        index_data = scrape_index(page)
        if not index_data or not index_data.get('results'):
            continue
        # 遍历当前页的每部电影,先拿 ID → 查详情 → 存库
        for item in index_data['results']:
            movie_id = item.get('id')
            detail_data = scrape_detail(movie_id)
            save_data(detail_data)

if __name__ == '__main__':
    main()

Run the script and you will see clear log output in the console, and MongoDB'smovies_spa1There will be a batch of more movie records in the database.


8. Improvement suggestions: Make the crawler more robust (simple version)

This article is a practical introduction, and the code is written very straightforwardly, but in actual projects, it may need to be more stable and efficient. Here are a few easy-to-use optimization ideas:

  1. Asynchronous request speedup useaiohttpreplacerequests, matchasyncioBy achieving concurrency, the crawling speed will be significantly improved.

  2. Add error retry Givescrape_apiplustenacityThe library's retry decorator automatically retries several times when encountering network fluctuations, greatly improving stability.

  3. Basic anti-crawling strategy Switch randomlyUser-Agent, and join between requeststime.sleep(), simulates human browsing rhythm and reduces the risk of being blocked.

  4. Simple verification of data Check before savingnameidWait for the existence of key fields to avoid "bad data" entering the database to pollute the collection.


9. Summary

In this actual combat, we completely mastered the Ajax dynamically loaded data crawler routine:

  1. Use firstrequestsDo static testing to confirm that the data is rendered asynchronously
  2. Use the browser developer tools to find the Ajax interface of the list page and details page
  3. Analyze the rules of interface parameters and encapsulate common request functions
  4. String together the processes, add logs and store them in the database

The complete code can be copied and run directly. Don’t forget to start MongoDB before doing it!