Simulated login technology in modern web crawlers
I believe every crawler developer has experienced such a moment: after finally writing the crawler logic, running it excitedly, only to be met with a cold response.401 Unauthorizedor403 Forbidden. What’s even more troublesome is that the login mechanism of modern websites is becoming increasingly complex. From traditional form login to WebAssembly encryption, from simple image verification codes to insensitive behavioral verification, every step is a roadblock.
This article will start with the mainstream login principles, cover everything from simple cookie reuse solutions to practical techniques for browser automation, and include important reminders on security compliance at the end.
📌 Reading Navigation
Jump directly to the corresponding chapter according to your needs:
- **New to crawling public websites that require login? ** → 2.1 直接 Cookie 复用 + 2.5 实战案例:GitHub 模拟登录
- **Need to crawl in batches but don’t want to grab cookies from the browser every time? ** → 2.2 自动化表单提交 + 3.2 会话保持
- **Encountered complex verification codes, sliders or two-step verification? ** → 2.3 浏览器自动化工具 + 2.4 复杂认证场景
- **Afraid of having your IP or account blocked? ** → 3.1 账号池管理 + 3.3 反反爬策略
- **Worried about stepping on legal red lines? ** → 4 安全与合规建议
1. Modern website login verification mechanism
If you want to simulate a login, you must first understand how the website "remembers who you are." Once you understand this logic, you will have a clear idea of the subsequent operations.
1.1 Traditional Session-Cookie mechanism
This is the most classic model that is still widely used by small and medium-sized websites. The whole process is like going to the gym and applying for a temporary card.
Simplified version process:
- You enter your username and password in the browser and click login
- After the server verifies that the information is correct, it creates a "Session" on the server, which stores your login status, expiration time and other information.
- The server generates a unique Session ID through
Set-CookieResponse headers returned to your browser - After that, the browser will automatically carry this cookie with every request, and the server will be able to recognize you after checking it.
Common improvements in 2024:
- No longer store the Session in the memory of a single server, but instead use distributed storage such as Redis to facilitate the sharing of status among multiple servers.
- Cookie added
HttpOnly(Disable JS reading to prevent XSS attacks),Secure(HTTPS transfer only),SameSite(Anti-cross-site request forgery) - Automatically change the Session ID after successful login to prevent session fixation attacks
1.2 JWT(JSON Web Token)
This is currently the preferred solution for separating mobile apps and front-end and back-end web applications, which is equivalent to issuing you an anti-counterfeiting ID card.
Simplified version process:
- You submit your login credentials
- After the server is verified, it does not save any status, but generates a string of encrypted Tokens and returns them to you, usually there
LocalStorageor in cookies - For every subsequent request, you must add the request header
Authorization: Bearer xxxBring this Token - The server decrypts the token itself and can know your identity and expiration time.
The core difference between the two mechanisms is that Session-Cookie is "the server remembers you", while JWT is "the token proves you".
1.3 OAuth 2.0 / OpenID Connect
The "Log in using WeChat/GitHub/Google" that can be seen everywhere now is this set of standard protocols, among which the authorization code mode (Authorization Code) is the most common scenario encountered by crawlers.
Simple understanding: You go to a third-party platform to log in. After the third party confirms that it is you, it gives the target website an "authorization code". The target website then uses this authorization code in exchange for your basic information (such as nickname, avatar), and your password will not be leaked to the target website in the entire process.
2. Modern crawler simulated login technology
Now that we have figured out how the website "recognizes people", we now turn around and see how the crawler "pretends" to be people.
2.1 Direct Cookie Reuse
Applicable scenarios
- ✅ I need to crawl some data temporarily
- ✅ The target account does not have two-step verification or behavioral verification code
- ✅ Cookie validity period is relatively long, enough for you to use for a while
Implementation steps
- Open the Chrome/Edge browser and press
F12Open the developer tools and switch to the Network tab - Log in to the target website normally in the browser
- Find the first status code in the Network list as
200or302, the domain name is the request of the target website, click to view Request Headers - Copy
Cookie:The entire following string, or only a few key items (such as sessionid, csrftoken) - Just bring these cookies in your crawler request
Code example (Python requests)
Tip: If the cookie expires soon, you can first check whether the expiration time is reasonable: use F12 → Application → Cookies to view
Expires / Max-Agefield.
2.2 Automated form submission
Applicable scenarios
- ✅ Need to log in to multiple accounts in batches
- ✅ No complicated verification codes or sliders in the login process
- ✅ The login process does not jump to third-party pages
Implementation steps
- Capture packets and analyze the login page to find hidden input fields (such as CSRF Token, timestamp). These dynamic values must be obtained first
- Capture the packet and analyze the login request, confirm the URL, request method (usually POST) and all necessary parameters
- use
requests.Session()Manage the entire process, it can automatically save and carry cookies, simulating the behavior of real browsers
Why is it necessary to use Session?
Never use them separatelyrequests.get()andrequests.post()Come and request separately! Because the cookie obtained by the first GET and the cookie carried by the second POST are not the same "session" at all, the server will not recognize it.
2.3 Browser automation tools
When simple HTTP request simulation is no longer enough, a real browser is needed.
Applicable scenarios
- ✅ Encountered slider, click, and behavior verification codes
- ✅ The login process involves third-party page jumps
- ✅ The website has complex browser fingerprint detection (such as detecting whether you are Headless Chrome)
Tool recommendation
Playwright example (simulating login to GitHub)
Install first:pip install playwright && playwright install chromium
2.4 Complex authentication scenarios
Login in reality is often much more complicated than the example code. The following is a breakdown of common difficulties:
1. Dynamic CSRF Token
Usually hidden on the login page<input type="hidden">inside. Just use BeautifulSoup or regular expressions to extract it. The key is to re-obtain it before logging in every time. It cannot be hard-coded.
2. Simple alphanumeric verification code
- OCR solution: Tesseract OCR, average recognition rate, suitable for simple scenarios
- Coding Platform: High recognition rate, pay-per-use, suitable for batch operations
3. Slider/click verification code
Prioritize using Playwright to simulate the sliding trajectory of real people. The core idea is to add random jitter and uneven speed changes to avoid being recognized as mechanical operations. If the recognition rate is still not ideal, you can look for a specialized slider cracking solution.
4. Two-step verification (TOTP dynamic code)
If you support standard TOTP such as Google Authenticator, use Python directlypyotpThe library can generate dynamic verification codes without a mobile phone at all.
2.5 Practical case: GitHub simulated login (requests version)
Next, combine the Session-Cookie mechanism and the extraction of dynamic CSRF Token to write a complete GitHub login script.
Pre-installation:pip install requests beautifulsoup4
3. Advanced techniques and best practices
Once you've mastered the basics, the following tips can help you go further.
3.1 Account pool management
**Never use a single account to do batch crawling! ** Once blocked, all work will be in vain.
You can maintain a simple account list, randomly select an account for each request, and use it with different IPs, which can greatly reduce the risk of being blocked.
3.2 Session maintenance and renewal
Both session-cookies and JWTs have expiration dates. If your crawler needs to run for a long time, you can encapsulate an automatic renewal class: automatically log in again to obtain new session credentials before the cookie is about to expire.
3.3 Anti-anti-climbing strategy
There is only one core idea: **Make your crawler behave more like a real person, the better. **
In addition, for large-volume crawling, agent pool is essential. Free proxies are usually not stable enough, and paid proxy services are recommended for commercial projects.
4. Security and Compliance Recommendations
⚠️ **This section is very important, please read it carefully. **
Simulating login crawlers is a high-risk area for legal risks. Please keep the following points in mind:
- Comply with Robots Agreement: Check first
目标网站/robots.txt, if a certain path is explicitly prohibited, don’t climb - No personal privacy data: Sensitive information such as mobile phone number, ID number, bank card number, etc. will never be collected
- Use test account: Do not use your real important account in the crawler
- Control request frequency: Do not put pressure on the target server. This is the most basic courtesy.
- Obtain authorization first for commercial purposes: Acting within a compliance framework is the safest course of action in the long run.
5. Future trends
The battle between crawlers and anti-crawlers will never end. Here are a few directions worth paying attention to:
- WebAssembly Encryption: More and more websites compile core encryption logic into Wasm, making reverse engineering significantly more difficult.
- Popularization of behavioral verification codes: From mouse movement trajectory to typing rhythm, the characteristics of machine behavior are continuously refined
- AI-driven dynamic defense: analyze traffic patterns in real time through machine learning and dynamically adjust interception strategies
- Headless browser detection upgrade: from simple User-Agent detection to GPU characteristics, browser plug-in fingerprints, etc.
Conclusion
Simulated login is a core skill that cannot be bypassed in crawler development. In the face of increasingly complex web security mechanisms, it is recommended to follow the following principles in actual projects:
- Prioritize finding legal and compliant solutions
- Start with simple cookie reuse and gradually deal with complex scenarios.
- Establish a complete error handling and log monitoring mechanism
- Keep code maintainable and avoid hard coding
Technical boundaries are often also moral boundaries. I hope this article can help you avoid some detours and also remind you to always act within the framework of compliance.

