title: Session + Cookie simulated login crawling practice description: Modern web crawler technology: Session + Cookie authenticated simulated login
Modern web crawler technology: Session + Cookie authenticated simulated login
1. Preface
If you just started learning reptiles, you must have encountered this kind of embarrassment:
You can grab the public page without any problem; but once you want to access your orders, favorites, and member area, you can either401 Unauthorized, or return a large "Please log in first" jump prompt.
At this time many people will panic. In fact, Session + Cookie authentication, as one of the oldest and most popular login solutions, is the easiest and most reliable breakthrough for novices.
This article will take you through a complete simulated login step by step using a real example website, from principles to code. At the end of the article, best practices and pitfall lists commonly used in production projects will also be given, so that you will be well aware of similar scenarios in the future.
2. Technical preparation
First set up the operating environment, the dependencies are not complicated.
2.1 Environmental requirements
- Python 3.8+ (3.10 is recommended for better requests compatibility)
- A modern browser (Chrome or Edge will work, for developer tool debugging)
- You don't need to download Chromedriver manually, later
webdriver-managerwill be processed automatically
2.2 Install dependencies
beautifulsoup4It is just for visually verifying the login result later. You don’t need to install it if you don’t want to read the page content.
3. Analyze the target website
The practice site we used this time is Teacher Cui Qingcai’s simulated login demonstration site:
https://login2.scrape.center/
It has no verification code and no complex JS encryption, but it completely replicates the entire process of real Session + Cookie login, which is especially suitable for getting started.
3.1 Quick overview of site behavior
- Traditional MVC architecture, the page is rendered directly by the backend
- Pure Session + Cookie authentication, does not involve modern solutions such as OAuth/JWT
- After successful login, you will be redirected to the homepage with 302
https://login2.scrape.center/, showing a list of movies
3.2 Disassemble the login process (use DevTools to check the door)
The most taboo thing about doing a crawler is to write code as soon as you start. First open Chrome's developer tools (F12 → Network), and let's see the entire login process step by step:
- Check Preserve log (Important! Otherwise the redirection log will be lost)
- Manually enter the account password
admin/admin, click login - Filter the request type in the Network panel to
Doc, only see document requests
You will see three key requests, which correspond to a complete login authentication chain:
Click on the 1st POST requestResponse Headers, you will see a very important line of response headers:
This is the "access card" sent by the backend——sessionid. After that, every time the browser requests the same domain name, it will automatically hang it inCookieTake it over.
Core Points: In Session + Cookie mode, the backend will only pass when the login is successful.
Set-CookieThe identifier is issued once; subsequent communications rely on the browser to carry this cookie. What the crawler has to do is to imitate the browser, save this "authentication credential", and return it unchanged in subsequent requests.
4. Three ways to implement simulated login
From the most common pitfalls for novices to the recommended production-level solutions, we will show them one by one.
4.1 Error-prone entry-level version: Manually manage cookies (❌ not recommended)
Many beginners do this: first send a POST request and extract theSet-Cookie, and then manually stuff it into subsequent GET requests. Although it can be run through, each request is like a newly opened browser window, and cookie management is cumbersome and error-prone.
The biggest problem with this method is not "can it run", but "high maintenance cost" - cookies must be updated manually when they expire, and each request must be explicitly passed. Once the request link becomes longer, the code will become a mess.
4.2 Standard recommended version: userequests.Session(✅Must learn)
requestsCurry brought his ownSessionObject, like a persistent browser session. It will automatically help you track all cookies, request headers, connection pools and other information. You only need to log in once, and all subsequent requests can be processed directly using this "session".
**Why is this method recommended? **
- Simple code, no need to manually manage cookies
- The status between requests is automatically maintained and is not easy to miss.
- You can easily add custom request headers, proxies and other configurations, which will take effect for the entire session
Most small and medium-sized websites and internal management backend logins userequests.SessionThat's it.
4.3 Advanced combination version: Selenium retrieves Cookies + Requests for efficient crawling (✅ necessary for advanced)
Some websites encounter graphical verification codes, slider verification, fingerprint detection, or complex JS encryption when logging in. In this case, directly userequestsIt is very difficult, if not impossible, to forge a login request.
At this time you can adopt a "hybrid engine" idea:
- First use Selenium to open a real browser and simulate user operations to complete the login
- After logging in, export the cookies in the browser
- Feed Cookie to
requests.Session, all subsequent crawling is handed over to lightweight HTTP requests
In this way, you can bypass complex human-machine verification and enjoyrequestsHigh-speed crawling.
Applicable scenarios: The login phase needs to handle complex human-machine verification, but most of the pages after login are static structures or simple asynchronous loading. Performance Advantage: Selenium is only responsible for the most troublesome login process, and large-scale crawling still uses
requests, much faster than fully controlling the browser.
5. Pitfall avoidance guide & production-level best practices
Even if the sample website does not have any anti-crawling, the real website will not be so "gentle". The following few experiences can save you a lot of detours.
5.1 Advanced gameplay of cookie management
- Persistent Storage Serialize the cookies obtained through login into a JSON file or store them in the database. The next time you start the crawler, try to load these cookies first. If they have not expired, use them directly to avoid frequent logins.
- Cookie Pool If the collection volume is large, you can prepare multiple accounts in advance, each account has a set of valid cookies, and they will be randomly rotated during capture. It not only reduces the risk of a single account being blocked, but also improves the overall stability.
- Active detection of effectiveness
Sessions on many sites have a fixed validity period (for example, 24 hours). Before each batch crawl, you can first request an interface that can only be accessed after logging in; if you return
401, the re-login process will be automatically triggered.
5.2 Necessary details for anti-climbing confrontation
- User‑Agent must be changed
python-requests/x.x.xSuch a default UA is almost "self-reporting" and must be changed to the value of the real browser. - Referer Don’t forget to bring
If the business logic depends on the request source, such as jumping from the list page to the details page, remember to change the request header
RefererSet to the URL of the list page. - Random delay
Join between consecutive requests
time.sleep(random.uniform(1, 3)), simulating the browsing rhythm of real people. Abrupt "uninterrupted" access can easily trigger frequency limits. - Control crawling speed and total amount Even if delays are added, try to avoid requesting hundreds of thousands of data in a short period of time. It not only reduces the pressure on the server, but also reduces the risk of your own IP being blocked.
6. Summary
Although the technology of Session + Cookie authentication is "old", it is more stable, easy to debug, and has wide applicability. It is still the preferred login method for a large number of websites.
For novices, the learning sequence is recommended:
- Master it first
requests.Session, it can solve 80% of login scenarios around you; - When encountering verification codes and complex JS, introduce the combination of Selenium cookie + Requests crawl.
As long as you clarify the "authentication credentials" line of the request link, most websites that "require login" will not be a problem for you.
Extended Reading (Stay tuned):
- Simulated login with OAuth 2.0 authentication
- JWT certified crawler processing
- Crawling strategy for WebAssembly websites

