cdut-admission-auto
🎯 Project background
This article shares the actual automated crawling of Chengdu University of Technology Admissions Information Network (🔗学院专业页面). It integrates Ruisu's fifth-generation dynamic verification, automated behavior detection, dynamic cookie/request header verification and a series of heavy-duty anti-crawling barriers. Directly using requests/urllib naked connection to get real data cannot be obtained, and a solution with complete browser rendering capabilities must be used to break through.
In the end, we implemented a one-stop process: Bypass all protection → Simulate human operations → Extract college professional data → Export to Excel.
🕵️ Quick analysis of web pages
Let’s take a quick look at the target page structure:
- On the surface, it appears to be a static layout of a college major list, but when you refresh the page, you will find that there is a brief Ruisu safety jump.
- Data passes
ul.xy-listdownli.li1Package each college and use it under each collegedd > aStore professional names and links
Observe the details of the protection layer through the Network + Console of the developer tools:
- The first request will return the Ruisu obfuscation script, which will generate something like
sMLAeTqisZbFPSuch dynamic cookies - The script will detect
navigator.webdriver, browser feature variables, mouse/keyboard interaction behavior - 412 errors are typical of missing request headers, and SSL certificates occasionally trigger validation failures when testing locally
🚩 Core Question List
- Dynamic Cookie Update: Cookies generated by Ruisu have a very short validity period, and purely static maintenance will expire immediately.
- Ruisu Fifth Generation Bypass: Confuse JS to dynamically generate verification data, making static reverse engineering extremely difficult
- SSL Certificate Verification: Some test environments will intercept HTTPS requests, causing connection failures.
- Complete request header construction: Referer, Accept-Language, Sec-Fetch-* If one of these fields is missing, it may be intercepted
- Anti-automation detection: Browser automation features must be hidden and real user operations simulated
🏗️ Technical Architecture
We adopted a hybrid solution of "DrissionPage browser automation + anti-detection JS injection + urllib safe request":
- DrissionPage: Much lighter than Selenium, with built-in intelligent waiting mechanism, especially suitable for processing complex rendering pages
- Anti-detection JS: Inject directly at the beginning of page loading, covering the automated features exposed by the browser
- urllib: After obtaining a valid cookie, it is used to make lightweight requests for data capture, reducing the continuous consumption of browser resources.
The core idea of this combination is: **Let the browser pass Ruisu verification, and subsequent data extraction is completed with lightweight requests, which is both safe and efficient. **
💻 Core function implementation
1. Browser initialization configuration
⚠️ Key parameter description:
timeout=15: Give Ruisu script enough time to execute to avoid verification failure due to premature operation.set.window.max(): Maximize the window to avoid typical automation features such as small windows and fixed resolutions
2. Anti-automated JS injection
Ruishu will pass the inspectionnavigator.webdriver, characteristic variables left by CDP injection, and evendebuggerBreakpoints to identify crawlers. We inject JS before loading the page and directly cover these detection points:
This JS will be run as soon as the browser loads the target page, ensuring that the loophole is plugged before the Ruisu script obtains features.
3. Human Behavior Simulation
Hiding features alone is not enough. Ruisu also monitors mouse, keyboard, scrolling and other interactive behaviors. Adding a simple random operation can greatly improve the pass rate:
All time intervals here are intentionally randomly jittered to imitate the imprecise human operating rhythm.
4. Ruisu Security Core Bypass
The key logic of the entire solution: First let the browser perform hard verification on the front, and after passing the verification, pass the valid cookie to urllib for subsequent lightweight requests.
⚠️ If the verification fails, don’t panic. First check whether the anti-detection JS is still adapted to the current Ruishu version, and capture and update it if necessary.
📡 Request construction module
After obtaining the valid cookie, we use urllib to encapsulate the security request to avoid frequently opening/closing the browser page, and also reduce the risk of being continuously monitored by the anti-crawling system.
1. Dynamic Cookie Extraction
There are two key cookies extracted here:
JSESSIONID:General session IDsMLAeTqisZbFP: Ruisu fifth generation dynamic token (the name may change dynamically, but the mode is similar)
2. Complete request header construction
Ruisu is extremely sensitive to request headers. Referer, User-Agent, Accept-Language, Sec-Fetch- series of fields cannot be missing*:
This set of request headers strictly imitates the request characteristics of the real Chrome browser, which can avoid 412 interceptions caused by missing request headers.
3. Security request encapsulation
Add random delay, retry mechanism, disable SSL verification (test environment only):
Random numbers are added to the retry interval and request interval to prevent overly regular rhythms from being recognized by the anti-crawling system.
📊 Data extraction and export
After getting the real HTML, use BeautifulSoup to parse the structure, and combine it with pandas to export it to Excel in one step:
Note: here used
urllib.parse.urljointo handle relative links and ensure that exported professional links are fully accessible.
🛡️ exception-handling solution
📝 Environment and Execution
Environmental requirements
- Python 3.8+
- Chrome/Chromium 100+ (built-in Chromium can also be automatically downloaded by DrissionPage)
- One-click installation of dependent libraries:
Execute command
Output example
📌 Notes
- For learning and communication only: Please do not use it for commercial purposes or large-scale crawling, and respect the school's server resources.
- Raisu version may be updated: If the anti-detection JS fails, you need to observe new detection points through developer tools and update in time.
- Use proxy IP with caution: The anti-crawling of this website focuses more on feature detection, and IP bans are less frequent. Frequent proxy IP changes can easily increase suspicious features.
- Data structure changes: It is recommended to regularly check the page structure of the school admissions website and adjust the parsing logic of BeautifulSoup in a timely manner to ensure accurate data capture.

