title: Ajax case crawling practice description: Python3 crawler tutorial: Ajax data crawling practice
Python3 crawler tutorial: Ajax data crawling practice
Tips at the beginning If you have written about crawlers, you must have seen this weird scene: use
requestsThe web page I pulled down only contains empty code.<div id="app"></div>, where is the data? Where have you gone? Don't worry, this is not a problem with your code, but the web page uses Ajax dynamic loading - this is also a common operation of modern front-end frameworks. This time we will use Mr. Cui Qingcai’s free target site spa1 to dismantle the entire process of the Ajax crawler step by step. The interface is clear and there is no complicated anti-climbing, which is very suitable for practicing.
1. Preparation
Before you start, make sure you have the following things in place on your computer:
- Python 3.6 and above
- Installed dependent libraries:
- MongoDB service has been started (This article will save the data to MongoDB. If you don’t want to install the database for the time being, you can go to the step of printing the data first)
- A basic understanding of the Network panel of the browser developer tools (F12) is enough
2. Target website analysis
Target station address: https://spa1.scrape.center/
2.1 List of website features
- All data is loaded asynchronously through Ajax without any server-side rendering.
- The front-end framework of the page is responsible for generating content, and the returned HTML is basically just a "skeleton"
- Support paginated browsing, about 10 pages, 10 movies per page
- Clicking on the movie card will jump to the details page. The detailed data also comes from the Ajax interface.
2.2 Fields we want to obtain
From the interface responses on the list page and detail page, you can get complete movie information:
- movie title
- Link to cover image
- Category tags (such as "Drama" and "Suspense")
- Release date
- Overall rating
- Plot synopsis
3. Preliminary verification: Can I get the data by directly requesting the page?
First write a few lines of the simplest code to test and see if you can use it directlyrequestsDoes the downloaded HTML contain the content we want:
The result will be an "empty shell" HTML containing only something like<div id="app"></div>Container tags, not even the shadow of keywords such as "movie" and "rating" can be seen.
This just confirms our judgment: **The data is not output directly by the server and needs to find the Ajax interface. **
4. Crawl the list page
4.1 Ajax interface for analyzing list pages
Open the browser, press F12 to bring up the developer tools, switch to the Network panel, be sure to check Preserve log (so that the request record will not be cleared when turning pages), and then click XHR/Fetch to filter and only see Ajax requests.
Next, click "Page 2" and "Page 3" on the page, and you will see a batch of requests with a highly uniform format appearing in the Network:
Parameter meaning
limit: Fixed to10, representing how many pieces of movie data are returned per pageoffset: Offset, page 1 is0, page 2 is10, page n is10 × (n - 1)
4.2 Encapsulate list page crawling function
In order to improve code reusability, we first write a general interface request function, and then specifically write a call function for the list page:
Now, callscrape_index(1)You can get the 10 movie data on page 1.
5. Crawl the details page
5.1 Ajax interface for analyzing details page
Click on a movie on the list page, enter the details page, and filter the XHR/Fetch request as well. You will find a new request similar to this:
Obviously, here1It is the unique ID of the movie.
The JSON interface of the list page has already included the content of each movieidThe field is returned to us, so we don't need to parse the jump link from the HTML at all, we can just use this ID to splice the URL.
5.2 Encapsulate the details page crawling function
Continue the logic just now and reuse it.scrape_apiFunction, three lines of code to get it done:
6. Data storage: Save to MongoDB
The interface of the target site returns a JSON object with a clear structure, which is very suitable for directly plugging into a document database such as MongoDB, and the fields do not require additional processing.
6.1 Configure MongoDB connection
Import firstpymongo, and connect to the local database:
6.2 Encapsulated data saving function
useupdate_oneCooperateupsert=TrueRealize data deduplication and update:
- If a movie with the same name already exists in the library, update the field information
- If not, insert a new record
7. Complete crawling process
Finally, string all the functions together and write onemain(), crawl all movies in the first 10 pages at once:
Run the script and you will see clear log output in the console, and MongoDB'smovies_spa1There will be a batch of more movie records in the database.
8. Improvement suggestions: Make the crawler more robust (simple version)
This article is a practical introduction, and the code is written very straightforwardly, but in actual projects, it may need to be more stable and efficient. Here are a few easy-to-use optimization ideas:
-
Asynchronous request speedup use
aiohttpreplacerequests, matchasyncioBy achieving concurrency, the crawling speed will be significantly improved. -
Add error retry Give
scrape_apiplustenacityThe library's retry decorator automatically retries several times when encountering network fluctuations, greatly improving stability. -
Basic anti-crawling strategy Switch randomly
User-Agent, and join between requeststime.sleep(), simulates human browsing rhythm and reduces the risk of being blocked. -
Simple verification of data Check before saving
name、idWait for the existence of key fields to avoid "bad data" entering the database to pollute the collection.
9. Summary
In this actual combat, we completely mastered the Ajax dynamically loaded data crawler routine:
- Use first
requestsDo static testing to confirm that the data is rendered asynchronously - Use the browser developer tools to find the Ajax interface of the list page and details page
- Analyze the rules of interface parameters and encapsulate common request functions
- String together the processes, add logs and store them in the database
The complete code can be copied and run directly. Don’t forget to start MongoDB before doing it!

