Modern crawler technology: Detailed explanation of Session and Cookie mechanism
Have you ever encountered this kind of stumbling block to getting started with crawlers:
I clearly used the requests library to send a POST request to the login interface, and the username and password were all correct. Then I went to crawl "My Orders" or "Personal Center", but the interface returned directly.401 Unauthorized, or simply redirect you back to the login page.
The core reason is hidden in the eight words "HTTP is a stateless protocol" - the server cannot remember who the guy who just logged in was. The "Web Standard Identity Pass" that solves this problem is what we are going to talk about today: Cookie + Session.
1. Evolution from stateless to state management (minimalist version)
1.1 The "Plain Text Dilemma" of Early Static Web Pages
The earliest web pages were all hard-coded HTML files, which probably looked like this:
- ✅ Advantages: The server pressure is extremely small, loading is so fast, just throw in Apache or Nginx and it will run
- ❌ Problem: It is completely impossible to distinguish users - whether it is Zhang San, Li Si or a hacker, they all see the same content
1.2 "Identity needs" generated by dynamic web pages
Later, technologies such as PHP, ASP, and Java Web appeared. Web pages could generate content in real time based on requests, but new problems followed:
After a user logs in once, how can the server always remember that this is the same logged-in user when browsing the shopping cart and submitting an order?
The HTTP protocol is inherently "forgetful", and each request is an independent stranger's conversation - so an identity tag must be added to each conversation. This is the prototype of Cookie and Session.
2. Cookie: "Small note" local to the browser
2.1 What are Cookies?
To summarize in one sentence: Cookie is a string of small text data that is stuffed by the server to the browser and stored locally. The size of a single cookie is generally within 4KB.
The whole process is like this:
- The browser sends a request to the server for the first time
- The server adds a line to the response header
Set-Cookie: 标签名=标签值; 属性1; 属性2... - After the browser receives it, save this "little note" according to the attribute requirements
- Every time after making a request to the same domain name/path, the browser will automatically add it to the request header.
Cookie: 所有符合条件的标签**
2.2 Key Cookie Attributes (required for crawlers and also core for security)
2.3 Reptile perspective: What does a cookie look like?
Open the browser developer tools (F12 → Network → Select a request → Headers) and you will see:
Set-Cookie in the response header (sent by the server):
Cookie in the request header (automatically brought by the browser):
Did you see that? The browser automatically splices all the cookies that meet the conditions for us and silently brings them with each request.
3. Session: "User File" on the server
3.1 What is Session?
Although cookies can store tags, they must not store sensitive information - such as user IDs and password hashes (although some people do this, it is extremely dangerous and can be easily stolen).
So the idea of Session is this:
- The server generates a globally unique Session ID (such as what you just saw
JSESSIONID) - Treat this Session ID as a "file number", and then store the real sensitive information (login status, user ID, shopping cart contents) on the server (memory, Redis, MySQL are all acceptable)
- pass
Set-CookieGive the browser the Session ID - Every time the browser sends a request, the server will use the Session ID in the request header to check the file - if it is found, it will know who you are. If it is not found, it will be considered that you are not logged in.
Key point: The browser only has one "key" (Session ID), and the real "treasure trove" (user data) is on the server side.
3.2 Common Session storage solutions
4. How does the crawler handle Session and Cookie? (core code)
The most common mistake newbies make: using it for every requestrequests.get()orrequests.post()Send it separately - this way it will be a "new browser session" every time, the cookie will not be automatically inherited, and you will naturally not be able to log in.
4.1 Basic operation: Use requests.Session() to automatically manage cookies
requests librarySessionThe object is an artifact that simulates a browser session and will automatically save, update, and carry cookies:
Core idea: All requests go through the same
sessionObject is issued, and the management of cookies is fully automated, just like a real browser.
4.2 Advanced skills: Cookie persistence (no need to log in every time)
If the session expiration time of the website you want to crawl is very long (such as one week), you can save the cookie to a local file and load it directly next time:
Tips:
pickleIt is the standard library of Python and is very convenient to use. But be careful not to load files from unknown sources.pklfiles, there are security risks.
5. Security and compliance issues that crawlers need to pay attention to
5.1 Small details related to anti-crawling
- Don’t use the default UA of the requests library: Many websites directly add the default UA to the blacklist and change it as soon as possible.
- Request frequency should not be too high: dozens of requests per second can easily trigger a ban, you can increase
time.sleep()Or access the proxy pool - Try to imitate the request sequence of real browsers: First visit the homepage to obtain the hidden CSRF Token, and then send a login request instead of directly attacking the login interface.
5.2 Ethics and Legal Compliance
- Comply with the website’s robots.txt protocol: Although not legally enforceable, this is a basic quality for crawler developers
- Do not crawl sensitive personal information: data such as ID number, mobile phone number, and bank card number will not be touched.
- Don’t excessively consume the server resources of the target website: For example, open 1000 concurrent threads for a small website
- Prefer using official API: If the website provides a public API, use the API directly instead of the crawler
6. Summary
In this article we discuss the following core issues:
- Why Session and Cookie are needed: Because HTTP is stateless and the server does not remember people by nature.
- What is a cookie: a small piece of paper stored locally in the browser, used to attach identity tags
- What is Session: Server-side user profile, used to store sensitive data
- How do crawlers deal with them: Use
requests.Session()Automatically manage cookies withpickleMake persistence - Security and Compliance: Not blocked or illegal, this is the bottom line
An in-depth understanding of the working principles of Session and Cookie is a required course for getting started with crawlers. After mastering these, you will no longer be stuck by the problem of "cannot get data after logging in".

