cdut-admission-auto

🎯 Project background

This article shares the actual automated crawling of Chengdu University of Technology Admissions Information Network (🔗学院专业页面). It integrates Ruisu's fifth-generation dynamic verification, automated behavior detection, dynamic cookie/request header verification and a series of heavy-duty anti-crawling barriers. Directly using requests/urllib naked connection to get real data cannot be obtained, and a solution with complete browser rendering capabilities must be used to break through.

In the end, we implemented a one-stop process: Bypass all protection → Simulate human operations → Extract college professional data → Export to Excel.


🕵️ Quick analysis of web pages

Let’s take a quick look at the target page structure:

  • On the surface, it appears to be a static layout of a college major list, but when you refresh the page, you will find that there is a brief Ruisu safety jump.
  • Data passesul.xy-listdownli.li1Package each college and use it under each collegedd > aStore professional names and links

Observe the details of the protection layer through the Network + Console of the developer tools:

  • The first request will return the Ruisu obfuscation script, which will generate something likesMLAeTqisZbFPSuch dynamic cookies
  • The script will detectnavigator.webdriver, browser feature variables, mouse/keyboard interaction behavior
  • 412 errors are typical of missing request headers, and SSL certificates occasionally trigger validation failures when testing locally

🚩 Core Question List

  1. Dynamic Cookie Update: Cookies generated by Ruisu have a very short validity period, and purely static maintenance will expire immediately.
  2. Ruisu Fifth Generation Bypass: Confuse JS to dynamically generate verification data, making static reverse engineering extremely difficult
  3. SSL Certificate Verification: Some test environments will intercept HTTPS requests, causing connection failures.
  4. Complete request header construction: Referer, Accept-Language, Sec-Fetch-* If one of these fields is missing, it may be intercepted
  5. Anti-automation detection: Browser automation features must be hidden and real user operations simulated

🏗️ Technical Architecture

We adopted a hybrid solution of "DrissionPage browser automation + anti-detection JS injection + urllib safe request":

  • DrissionPage: Much lighter than Selenium, with built-in intelligent waiting mechanism, especially suitable for processing complex rendering pages
  • Anti-detection JS: Inject directly at the beginning of page loading, covering the automated features exposed by the browser
  • urllib: After obtaining a valid cookie, it is used to make lightweight requests for data capture, reducing the continuous consumption of browser resources.

The core idea of ​​​​this combination is: **Let the browser pass Ruisu verification, and subsequent data extraction is completed with lightweight requests, which is both safe and efficient. **


💻 Core function implementation

1. Browser initialization configuration

from DrissionPage import ChromiumPage

class AntiAntiSpider:
    def __init__(self):
        self.browser = ChromiumPage(timeout=15)
        self.browser.set.window.max()  # 最大化窗口降低分辨率/视口特征
        self.target_url = "https://www.zs.cdut.edu.cn/xyzy.htm"

⚠️ Key parameter description:

  • timeout=15: Give Ruisu script enough time to execute to avoid verification failure due to premature operation.
  • set.window.max(): Maximize the window to avoid typical automation features such as small windows and fixed resolutions

2. Anti-automated JS injection

Ruishu will pass the inspectionnavigator.webdriver, characteristic variables left by CDP injection, and evendebuggerBreakpoints to identify crawlers. We inject JS before loading the page and directly cover these detection points:

// 核心注入代码
// 1. 禁用所有类型的debugger
window.debugger = function(){};
Object.defineProperty(window, 'debugger', {
    get: function(){ return null; },
    set: function(){},
    configurable: false
});

// 2. 覆盖Selenium/CDP的webdriver标识
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(window, 'navigator', {value: {webdriver: undefined}});

// 3. 删除CDP特征变量(常见于Selenium/Playwright)
const propsToDelete = [
    'cdc_adoQpoasnfa76pfcZLmcfl_Array',
    'cdc_adoQpoasnfa76pfcZLmcfl_Object',
    'cdc_adoQpoasnfa76pfcZLmcfl_Promise',
    'cdc_adoQpoasnfa76pfcZLmcfl_Proxy',
    'cdc_adoQpoasnfa76pfcZLmcfl_Symbol'
];
propsToDelete.forEach(prop => delete window[prop]);

// 4. 拦截含debugger的setInterval/setTimeout
const originalSetInterval = window.setInterval;
window.setInterval = function(callback, delay) {
    if (callback.toString().includes('debugger')) return 0;
    return originalSetInterval(callback, delay);
};
const originalSetTimeout = window.setTimeout;
window.setTimeout = function(callback, delay) {
    if (callback.toString().includes('debugger')) return 0;
    return originalSetTimeout(callback, delay);
};

This JS will be run as soon as the browser loads the target page, ensuring that the loophole is plugged before the Ruisu script obtains features.


3. Human Behavior Simulation

Hiding features alone is not enough. Ruisu also monitors mouse, keyboard, scrolling and other interactive behaviors. Adding a simple random operation can greatly improve the pass rate:

import time
import random

def simulate_human_behavior(self):
    # 非匀速随机滚动3-5次
    for _ in range(random.randint(3, 5)):
        scroll_px = random.randint(200, 900)
        self.browser.scroll.down(scroll_px)
        time.sleep(random.uniform(0.4, 1.8))  # 0.4-1.8秒的随机间隔

    # 滚动回顶部附近
    self.browser.scroll.to_top()
    time.sleep(random.uniform(0.6, 1.2))

    # 点击页面空白处(避免页面完全无交互特征)
    self.browser.ele("tag:body").click()
    time.sleep(random.uniform(0.8, 1.5))

All time intervals here are intentionally randomly jittered to imitate the imprecise human operating rhythm.


4. Ruisu Security Core Bypass

The key logic of the entire solution: First let the browser perform hard verification on the front, and after passing the verification, pass the valid cookie to urllib for subsequent lightweight requests.

def bypass_ruishi(self):
    try:
        # 首次访问触发瑞数验证
        self.browser.get(self.target_url)
        # 注入反检测JS(要在页面刚加载,瑞数还没完全执行时注入)
        self.browser.run_js(self.anti_detection_js)

        # 核心等待:先等页面开始加载真实内容,再额外等3-6秒
        self.browser.wait.load_start()
        wait_time = random.uniform(4, 6)
        time.sleep(wait_time)
        print(f"瑞数验证等待时间:{wait_time:.1f}s")

        # 检查验证是否通过
        if "验证" in self.browser.title or "瑞数" in self.browser.html:
            raise Exception("瑞数验证触发,可能需要更新反检测JS或手动确认")

        return True

    except Exception as e:
        print(f"❌ 瑞数验证处理失败: {str(e)}")
        return False

⚠️ If the verification fails, don’t panic. First check whether the anti-detection JS is still adapted to the current Ruishu version, and capture and update it if necessary.


📡 Request construction module

After obtaining the valid cookie, we use urllib to encapsulate the security request to avoid frequently opening/closing the browser page, and also reduce the risk of being continuously monitored by the anti-crawling system.

def get_cookies_dict(self):
    cookies = self.browser.cookies()
    return {cookie['name']: cookie['value'] for cookie in cookies}

There are two key cookies extracted here:

  • JSESSIONID:General session ID
  • sMLAeTqisZbFP: Ruisu fifth generation dynamic token (the name may change dynamically, but the mode is similar)

2. Complete request header construction

Ruisu is extremely sensitive to request headers. Referer, User-Agent, Accept-Language, Sec-Fetch- series of fields cannot be missing*:

def create_request(self, url, cookies=None, headers=None):
    if headers is None:
        headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
            "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Pragma": "no-cache",
            "Referer": self.target_url,  # 必须和目标域名一致
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "same-origin",
            "Upgrade-Insecure-Requests": "1",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
        }

    req = urllib.request.Request(url, headers=headers)

    if cookies:
        cookie_str = "; ".join([f"{k}={v}" for k, v in cookies.items()])
        req.add_header("Cookie", cookie_str)

    return req

This set of request headers strictly imitates the request characteristics of the real Chrome browser, which can avoid 412 interceptions caused by missing request headers.

3. Security request encapsulation

Add random delay, retry mechanism, disable SSL verification (test environment only):

import ssl
import urllib.request
from urllib.error import URLError, HTTPError

def safe_urlopen(self, url, max_retries=3, timeout=30):
    cookies = self.get_cookies_dict()

    for i in range(max_retries):
        try:
            req = self.create_request(url, cookies=cookies)
            # 禁用SSL验证(仅用于测试环境,生产环境建议安装目标证书)
            context = ssl._create_unverified_context()
            # 随机延迟1-3秒
            time.sleep(random.uniform(1, 3))

            with urllib.request.urlopen(req, timeout=timeout, context=context) as response:
                return response.read().decode('utf-8')

        except (URLError, HTTPError) as e:
            print(f"⚠️ 尝试 {i + 1}/{max_retries} 失败: {str(e)}")
            if i < max_retries - 1:
                time.sleep(random.uniform(2, 5))
                continue
            raise Exception(f"❌ 所有 {max_retries} 次请求均失败")

Random numbers are added to the retry interval and request interval to prevent overly regular rhythms from being recognized by the anti-crawling system.


📊 Data extraction and export

After getting the real HTML, use BeautifulSoup to parse the structure, and combine it with pandas to export it to Excel in one step:

import pandas as pd
from bs4 import BeautifulSoup

# 解析并提取数据
with open("result.html", "r", encoding="utf-8") as f:
    html_content = f.read()

soup = BeautifulSoup(html_content, 'html.parser')
data = []

# 遍历每个学院的li标签
for li in soup.find_all('li', class_='li1'):
    h6 = li.find('h6')
    if h6:
        college_name = h6.find('i').text.strip()
        # 遍历该学院下的所有专业
        for dd in li.find_all('dd'):
            a_tag = dd.find('a')
            if a_tag:
                major_name = a_tag.text.strip()
                # 拼接专业链接的绝对路径(处理相对链接)
                major_url = urllib.parse.urljoin("https://www.zs.cdut.edu.cn", a_tag['href'])
                data.append({
                    '学院名称': college_name,
                    '专业名称': major_name,
                    '专业链接': major_url
                })

# 导出Excel
excel_file = '成都理工大学学院专业信息.xlsx'
with pd.ExcelWriter(excel_file, engine='openpyxl') as writer:
    df = pd.DataFrame(data)
    df.to_excel(writer, index=False, sheet_name='学院专业列表')

print(f"✅ 数据已成功保存到 {excel_file}")

Note: here usedurllib.parse.urljointo handle relative links and ensure that exported professional links are fully accessible.


🛡️ exception-handling solution

Error TypeTrigger ConditionSolution
Ruisu verification failedAutomated features detected1. Update anti-detection JS
2. Increase random waiting time
3. Check whether there are the latest CDP feature variables
412 Precondition FailedThe request header is incomplete1. Complete the Referer and Sec-Fetch-* series fields
2. Update the User-Agent to the latest Chrome version
Connection reset/503 errorIP is temporarily blocked1. Reduce the request frequency to 2 - 5 seconds/time
2. Enable the proxy function of DrissionPage
SSL certificate verification failedLocal HTTPS interception1. For test environmentssl._create_unverified_context()
2. Install the root certificate of the target website in the production environment

📝 Environment and Execution

Environmental requirements

  • Python 3.8+
  • Chrome/Chromium 100+ (built-in Chromium can also be automatically downloaded by DrissionPage)
  • One-click installation of dependent libraries:
pip install DrissionPage pandas beautifulsoup4 openpyxl

Execute command

python cdut_spider.py

Output example

✅ 瑞数验证等待时间:4.7s
开始模拟人类操作...
获取页面数据...
✅ 成功获取受保护数据!
关闭浏览器...
✅ 数据已成功保存到 成都理工大学学院专业信息.xlsx

📌 Notes

  1. For learning and communication only: Please do not use it for commercial purposes or large-scale crawling, and respect the school's server resources.
  2. Raisu version may be updated: If the anti-detection JS fails, you need to observe new detection points through developer tools and update in time.
  3. Use proxy IP with caution: The anti-crawling of this website focuses more on feature detection, and IP bans are less frequent. Frequent proxy IP changes can easily increase suspicious features.
  4. Data structure changes: It is recommended to regularly check the page structure of the school admissions website and adjust the parsing logic of BeautifulSoup in a timely manner to ensure accurate data capture.