Simulated login technology in modern web crawlers

I believe every crawler developer has experienced such a moment: after finally writing the crawler logic, running it excitedly, only to be met with a cold response.401 Unauthorizedor403 Forbidden. What’s even more troublesome is that the login mechanism of modern websites is becoming increasingly complex. From traditional form login to WebAssembly encryption, from simple image verification codes to insensitive behavioral verification, every step is a roadblock.

This article will start with the mainstream login principles, cover everything from simple cookie reuse solutions to practical techniques for browser automation, and include important reminders on security compliance at the end.


📌 Reading Navigation

Jump directly to the corresponding chapter according to your needs:

  1. **New to crawling public websites that require login? ** → 2.1 直接 Cookie 复用 + 2.5 实战案例:GitHub 模拟登录
  2. **Need to crawl in batches but don’t want to grab cookies from the browser every time? ** → 2.2 自动化表单提交 + 3.2 会话保持
  3. **Encountered complex verification codes, sliders or two-step verification? ** → 2.3 浏览器自动化工具 + 2.4 复杂认证场景
  4. **Afraid of having your IP or account blocked? ** → 3.1 账号池管理 + 3.3 反反爬策略
  5. **Worried about stepping on legal red lines? ** → 4 安全与合规建议

1. Modern website login verification mechanism

If you want to simulate a login, you must first understand how the website "remembers who you are." Once you understand this logic, you will have a clear idea of ​​the subsequent operations.

This is the most classic model that is still widely used by small and medium-sized websites. The whole process is like going to the gym and applying for a temporary card.

Simplified version process:

  1. You enter your username and password in the browser and click login
  2. After the server verifies that the information is correct, it creates a "Session" on the server, which stores your login status, expiration time and other information.
  3. The server generates a unique Session ID throughSet-CookieResponse headers returned to your browser
  4. After that, the browser will automatically carry this cookie with every request, and the server will be able to recognize you after checking it.

Common improvements in 2024:

  • No longer store the Session in the memory of a single server, but instead use distributed storage such as Redis to facilitate the sharing of status among multiple servers.
  • Cookie addedHttpOnly(Disable JS reading to prevent XSS attacks),Secure(HTTPS transfer only),SameSite(Anti-cross-site request forgery)
  • Automatically change the Session ID after successful login to prevent session fixation attacks

1.2 JWT(JSON Web Token)

This is currently the preferred solution for separating mobile apps and front-end and back-end web applications, which is equivalent to issuing you an anti-counterfeiting ID card.

Simplified version process:

  1. You submit your login credentials
  2. After the server is verified, it does not save any status, but generates a string of encrypted Tokens and returns them to you, usually thereLocalStorageor in cookies
  3. For every subsequent request, you must add the request headerAuthorization: Bearer xxxBring this Token
  4. The server decrypts the token itself and can know your identity and expiration time.

The core difference between the two mechanisms is that Session-Cookie is "the server remembers you", while JWT is "the token proves you".


1.3 OAuth 2.0 / OpenID Connect

The "Log in using WeChat/GitHub/Google" that can be seen everywhere now is this set of standard protocols, among which the authorization code mode (Authorization Code) is the most common scenario encountered by crawlers.

Simple understanding: You go to a third-party platform to log in. After the third party confirms that it is you, it gives the target website an "authorization code". The target website then uses this authorization code in exchange for your basic information (such as nickname, avatar), and your password will not be leaked to the target website in the entire process.


2. Modern crawler simulated login technology

Now that we have figured out how the website "recognizes people", we now turn around and see how the crawler "pretends" to be people.


Applicable scenarios

  • ✅ I need to crawl some data temporarily
  • ✅ The target account does not have two-step verification or behavioral verification code
  • ✅ Cookie validity period is relatively long, enough for you to use for a while

Implementation steps

  1. Open the Chrome/Edge browser and pressF12Open the developer tools and switch to the Network tab
  2. Log in to the target website normally in the browser
  3. Find the first status code in the Network list as200or302, the domain name is the request of the target website, click to view Request Headers
  4. CopyCookie:The entire following string, or only a few key items (such as sessionid, csrftoken)
  5. Just bring these cookies in your crawler request

Code example (Python requests)

import requests

# 推荐:传字典形式的 Cookie,清晰易维护
cookies = {
    "sessionid": "abc123def456...",
    "csrftoken": "xyz789..."
}

response = requests.get(
    "https://example.com/protected-page",
    cookies=cookies
)
print(response.status_code)  # 200 即为成功

Tip: If the cookie expires soon, you can first check whether the expiration time is reasonable: use F12 → Application → Cookies to viewExpires / Max-Agefield.


2.2 Automated form submission

Applicable scenarios

  • ✅ Need to log in to multiple accounts in batches
  • ✅ No complicated verification codes or sliders in the login process
  • ✅ The login process does not jump to third-party pages

Implementation steps

  1. Capture packets and analyze the login page to find hidden input fields (such as CSRF Token, timestamp). These dynamic values ​​must be obtained first
  2. Capture the packet and analyze the login request, confirm the URL, request method (usually POST) and all necessary parameters
  3. userequests.Session()Manage the entire process, it can automatically save and carry cookies, simulating the behavior of real browsers

Why is it necessary to use Session?

Never use them separatelyrequests.get()andrequests.post()Come and request separately! Because the cookie obtained by the first GET and the cookie carried by the second POST are not the same "session" at all, the server will not recognize it.

import requests

# 正确做法:创建 Session 贯穿始终
session = requests.Session()
# GET 请求拿隐藏参数(Session 会自动保存服务器返回的 Cookie)
login_page = session.get("https://example.com/login")
# POST 提交登录(Session 会自动带上之前的 Cookie)
response = session.post("https://example.com/api/login", data={...})

2.3 Browser automation tools

When simple HTTP request simulation is no longer enough, a real browser is needed.

Applicable scenarios

  • ✅ Encountered slider, click, and behavior verification codes
  • ✅ The login process involves third-party page jumps
  • ✅ The website has complex browser fingerprint detection (such as detecting whether you are Headless Chrome)

Tool recommendation

ToolsFeaturesRecommendation
PlaywrightMicrosoft open source, simple API, automatically handles browser fingerprints, active updates⭐⭐⭐⭐⭐
Selenium 4.0+Old tool, mature ecology, but the configuration is a bit cumbersome⭐⭐⭐⭐
PyppeteerPython version of Puppeteer, has not been maintained for a long time⭐⭐ (not recommended)

Playwright example (simulating login to GitHub)

Install first:pip install playwright && playwright install chromium

from playwright.sync_api import sync_playwright

def github_login_playwright(username, password):
    with sync_playwright() as p:
        # 启动浏览器
        # headless=False 可以看到浏览器操作过程,方便调试
        # slow_mo=100 给每个操作加 100ms 延迟,模拟真人速度
        browser = p.chromium.launch(headless=False, slow_mo=100)
        page = browser.new_page(
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            )
        )

        # 导航到登录页
        page.goto("https://github.com/login")

        # 填写表单 —— Playwright 会自动等待元素出现
        page.fill("#login_field", username)
        page.fill("#password", password)
        page.click("input[type='submit']")

        # 验证登录是否成功
        try:
            page.wait_for_selector(
                "button[aria-label='View profile and more']",
                timeout=10000
            )
            print("✅ 登录成功!")
            # 把 Cookie 保存下来,下次可以直接复用
            cookies = page.context.cookies()
            return cookies
        except Exception:
            print("❌ 登录失败!")
            return None
        finally:
            browser.close()

2.4 Complex authentication scenarios

Login in reality is often much more complicated than the example code. The following is a breakdown of common difficulties:

1. Dynamic CSRF Token

Usually hidden on the login page<input type="hidden">inside. Just use BeautifulSoup or regular expressions to extract it. The key is to re-obtain it before logging in every time. It cannot be hard-coded.

2. Simple alphanumeric verification code

  • OCR solution: Tesseract OCR, average recognition rate, suitable for simple scenarios
  • Coding Platform: High recognition rate, pay-per-use, suitable for batch operations

3. Slider/click verification code

Prioritize using Playwright to simulate the sliding trajectory of real people. The core idea is to add random jitter and uneven speed changes to avoid being recognized as mechanical operations. If the recognition rate is still not ideal, you can look for a specialized slider cracking solution.

4. Two-step verification (TOTP dynamic code)

If you support standard TOTP such as Google Authenticator, use Python directlypyotpThe library can generate dynamic verification codes without a mobile phone at all.


2.5 Practical case: GitHub simulated login (requests version)

Next, combine the Session-Cookie mechanism and the extraction of dynamic CSRF Token to write a complete GitHub login script.

Pre-installation:pip install requests beautifulsoup4

import requests
from bs4 import BeautifulSoup

def github_login_requests(username, password):
    # 必须使用 Session!
    session = requests.Session()
    session.headers.update({
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        "Referer": "https://github.com/"
    })

    try:
        # 第一步:访问登录页,获取 CSRF Token
        login_page = session.get(
            "https://github.com/login",
            timeout=10
        )
        login_page.raise_for_status()
        soup = BeautifulSoup(login_page.text, "html.parser")

        # GitHub 的 CSRF Token 字段叫 authenticity_token
        token_input = soup.find("input", {"name": "authenticity_token"})
        if not token_input:
            raise ValueError("未找到 authenticity_token,页面结构可能已变更")
        authenticity_token = token_input["value"]

        # 第二步:构造登录数据并提交
        login_data = {
            "commit": "Sign in",
            "authenticity_token": authenticity_token,
            "login": username,
            "password": password,
            "trusted_device": "",
            "webauthn-support": "supported",
        }

        response = session.post(
            "https://github.com/session",
            data=login_data,
            timeout=10,
            allow_redirects=True
        )
        response.raise_for_status()

        # 第三步:验证登录结果
        if "Sign out" in response.text:
            print("✅ 登录成功!")
            return session
        else:
            print("❌ 登录失败!请检查账号密码,或确认是否需要二步验证。")
            return None

    except requests.RequestException as e:
        print(f"❌ 网络请求出错:{e}")
        return None
    except Exception as e:
        print(f"❌ 发生未知错误:{e}")
        return None

3. Advanced techniques and best practices

Once you've mastered the basics, the following tips can help you go further.

3.1 Account pool management

**Never use a single account to do batch crawling! ** Once blocked, all work will be in vain.

You can maintain a simple account list, randomly select an account for each request, and use it with different IPs, which can greatly reduce the risk of being blocked.

3.2 Session maintenance and renewal

Both session-cookies and JWTs have expiration dates. If your crawler needs to run for a long time, you can encapsulate an automatic renewal class: automatically log in again to obtain new session credentials before the cookie is about to expire.

3.3 Anti-anti-climbing strategy

There is only one core idea: **Make your crawler behave more like a real person, the better. **

# 实战中的常见做法示例
import time
import random
from fake_useragent import UserAgent

# 随机 User-Agent
ua = UserAgent()
headers = {
    "User-Agent": ua.random,
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Referer": "https://previous-page.com/",
}

# 每次请求前随机等待 1~3 秒
time.sleep(random.uniform(1, 3))

In addition, for large-volume crawling, agent pool is essential. Free proxies are usually not stable enough, and paid proxy services are recommended for commercial projects.


4. Security and Compliance Recommendations

⚠️ **This section is very important, please read it carefully. **

Simulating login crawlers is a high-risk area for legal risks. Please keep the following points in mind:

  1. Comply with Robots Agreement: Check first目标网站/robots.txt, if a certain path is explicitly prohibited, don’t climb
  2. No personal privacy data: Sensitive information such as mobile phone number, ID number, bank card number, etc. will never be collected
  3. Use test account: Do not use your real important account in the crawler
  4. Control request frequency: Do not put pressure on the target server. This is the most basic courtesy.
  5. Obtain authorization first for commercial purposes: Acting within a compliance framework is the safest course of action in the long run.

The battle between crawlers and anti-crawlers will never end. Here are a few directions worth paying attention to:

  • WebAssembly Encryption: More and more websites compile core encryption logic into Wasm, making reverse engineering significantly more difficult.
  • Popularization of behavioral verification codes: From mouse movement trajectory to typing rhythm, the characteristics of machine behavior are continuously refined
  • AI-driven dynamic defense: analyze traffic patterns in real time through machine learning and dynamically adjust interception strategies
  • Headless browser detection upgrade: from simple User-Agent detection to GPU characteristics, browser plug-in fingerprints, etc.

Conclusion

Simulated login is a core skill that cannot be bypassed in crawler development. In the face of increasingly complex web security mechanisms, it is recommended to follow the following principles in actual projects:

  1. Prioritize finding legal and compliant solutions
  2. Start with simple cookie reuse and gradually deal with complex scenarios.
  3. Establish a complete error handling and log monitoring mechanism
  4. Keep code maintainable and avoid hard coding

Technical boundaries are often also moral boundaries. I hope this article can help you avoid some detours and also remind you to always act within the framework of compliance.