Modern crawler technology: Detailed explanation of Session and Cookie mechanism

Have you ever encountered this kind of stumbling block to getting started with crawlers:

I clearly used the requests library to send a POST request to the login interface, and the username and password were all correct. Then I went to crawl "My Orders" or "Personal Center", but the interface returned directly.401 Unauthorized, or simply redirect you back to the login page.

The core reason is hidden in the eight words "HTTP is a stateless protocol" - the server cannot remember who the guy who just logged in was. The "Web Standard Identity Pass" that solves this problem is what we are going to talk about today: Cookie + Session.


1. Evolution from stateless to state management (minimalist version)

1.1 The "Plain Text Dilemma" of Early Static Web Pages

The earliest web pages were all hard-coded HTML files, which probably looked like this:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8" />
    <title>2000年的个人主页</title>
  </head>
  <body>
    <h1>欢迎来到张三的静态页!</h1>
    <p>每天只能手动改代码更新内容...</p>
  </body>
</html>
  • Advantages: The server pressure is extremely small, loading is so fast, just throw in Apache or Nginx and it will run
  • Problem: It is completely impossible to distinguish users - whether it is Zhang San, Li Si or a hacker, they all see the same content

1.2 "Identity needs" generated by dynamic web pages

Later, technologies such as PHP, ASP, and Java Web appeared. Web pages could generate content in real time based on requests, but new problems followed:

After a user logs in once, how can the server always remember that this is the same logged-in user when browsing the shopping cart and submitting an order?

The HTTP protocol is inherently "forgetful", and each request is an independent stranger's conversation - so an identity tag must be added to each conversation. This is the prototype of Cookie and Session.


2.1 What are Cookies?

To summarize in one sentence: Cookie is a string of small text data that is stuffed by the server to the browser and stored locally. The size of a single cookie is generally within 4KB.

The whole process is like this:

  1. The browser sends a request to the server for the first time
  2. The server adds a line to the response headerSet-Cookie: 标签名=标签值; 属性1; 属性2...
  3. After the browser receives it, save this "little note" according to the attribute requirements
  4. Every time after making a request to the same domain name/path, the browser will automatically add it to the request header.Cookie: 所有符合条件的标签**
PropertiesEffect on crawlersDescription
DomainThe domain name where the cookie is valid, for example, it can only be given toexample.comSend, can’t givetest.example.comIf you cross subdomain names when crawling, remember to check whether the Domain matches
PathThe path where the cookie takes effect, for example, only in/apiIt will take effect underWhen crawling the interface, don’t just stub the domain name’s cookies, as you may miss the interface-specific cookies
Expires / Max-AgeExpiration time; if it is not set, it is a "session cookie" (delete when you close the browser), if it is set, it is a "persistent cookie"The persistent cookie after logging in can be saved and reused, such as crawling an e-commerce store and logging in again the next day
SecureOnly HTTPS requests will carry this cookieNowadays, most websites are HTTPS, and HTTP crawling is rare
HttpOnlyOnly the browser kernel/back-end interface can read it, JavaScript cannot read it (to prevent XSS attacks)Crawlers should not try to use JS to capture this kind of cookie, just use the ready-made ones
SameSiteControl whether cross-site requests carry cookies; Chrome defaults toLax(Only required for same-site jumps or navigation requests)Pay attention to the limitations of this attribute when crawling cross-site embedded content

Open the browser developer tools (F12 → Network → Select a request → Headers) and you will see:

Set-Cookie in the response header (sent by the server):

Set-Cookie: JSESSIONID=abc123def456; Path=/; HttpOnly; SameSite=Lax
Set-Cookie: user_theme=dark; Max-Age=86400; Domain=example.com

Cookie in the request header (automatically brought by the browser):

Cookie: JSESSIONID=abc123def456; user_theme=dark

Did you see that? The browser automatically splices all the cookies that meet the conditions for us and silently brings them with each request.


3. Session: "User File" on the server

3.1 What is Session?

Although cookies can store tags, they must not store sensitive information - such as user IDs and password hashes (although some people do this, it is extremely dangerous and can be easily stolen).

So the idea of ​​Session is this:

  1. The server generates a globally unique Session ID (such as what you just sawJSESSIONID
  2. Treat this Session ID as a "file number", and then store the real sensitive information (login status, user ID, shopping cart contents) on the server (memory, Redis, MySQL are all acceptable)
  3. passSet-CookieGive the browser the Session ID
  4. Every time the browser sends a request, the server will use the Session ID in the request header to check the file - if it is found, it will know who you are. If it is not found, it will be considered that you are not logged in.

Key point: The browser only has one "key" (Session ID), and the real "treasure trove" (user data) is on the server side.

3.2 Common Session storage solutions

Storage methodFeaturesCommon scenarios
Server memoryThe fastest, but all files are lost after restartingSingle small website, test environment
MySQL / PostgreSQLPersistence, no loss, but slow queryScenarios with small traffic and long-term storage
Redis (mainstream recommendation)Extremely fast, supports automatic cleanup after expiration, naturally suitable for distributionMedium and large e-commerce and social platforms
JWT (stateless alternative)The file is stored directly in the client's encrypted Token, and the server does not need to store it - but the Token cannot be automatically invalidated after it is issuedMobile APP, front-end and back-end separation API

The most common mistake newbies make: using it for every requestrequests.get()orrequests.post()Send it separately - this way it will be a "new browser session" every time, the cookie will not be automatically inherited, and you will naturally not be able to log in.

4.1 Basic operation: Use requests.Session() to automatically manage cookies

requests librarySessionThe object is an artifact that simulates a browser session and will automatically save, update, and carry cookies:

import requests

# 1. 创建一个「浏览器会话」
session = requests.Session()

# 2. 设置 UA(伪装成 Chrome,不然很多网站会直接拦截)
session.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/128.0.0.0 Safari/537.36"
    )
})

# 3. 发登录请求(Cookie 会自动存在 session 里)
login_url = "https://example.com/api/login"
login_data = {"username": "test_user", "password": "test_password123"}
resp = session.post(login_url, data=login_data)

# 4. 检查登录是否成功
if resp.status_code == 200 and resp.json().get("code") == 0:
    print("登录成功!")
else:
    print(f"登录失败:{resp.text}")

# 5. 后续请求直接用 session 发,自动携带 Cookie
profile_url = "https://example.com/api/profile"
profile_resp = session.get(profile_url)
print(profile_resp.text)  # 这里就能拿到个人中心的数据了

Core idea: All requests go through the samesessionObject is issued, and the management of cookies is fully automated, just like a real browser.

If the session expiration time of the website you want to crawl is very long (such as one week), you can save the cookie to a local file and load it directly next time:

import requests
import pickle  # Python 标准库,用于对象序列化

# ---- 第一步:登录成功后保存 Cookie ----
session = requests.Session()
session.post("https://example.com/api/login", data={"username": "...", "password": "..."})

with open("example_cookies.pkl", "wb") as f:
    pickle.dump(session.cookies, f)
print("Cookie 已保存到本地")

# ---- 第二步:下次运行时直接加载 ----
session = requests.Session()
with open("example_cookies.pkl", "rb") as f:
    session.cookies.update(pickle.load(f))

# 验证 Cookie 是否还有效
profile_resp = session.get("https://example.com/api/profile")
if "test_user" in profile_resp.text:
    print("Cookie 有效,免登录成功!")
else:
    print("Cookie 过期了,需要重新登录")

Tips:pickleIt is the standard library of Python and is very convenient to use. But be careful not to load files from unknown sources.pklfiles, there are security risks.


5. Security and compliance issues that crawlers need to pay attention to

  • Don’t use the default UA of the requests library: Many websites directly add the default UA to the blacklist and change it as soon as possible.
  • Request frequency should not be too high: dozens of requests per second can easily trigger a ban, you can increasetime.sleep()Or access the proxy pool
  • Try to imitate the request sequence of real browsers: First visit the homepage to obtain the hidden CSRF Token, and then send a login request instead of directly attacking the login interface.
  • Comply with the website’s robots.txt protocol: Although not legally enforceable, this is a basic quality for crawler developers
  • Do not crawl sensitive personal information: data such as ID number, mobile phone number, and bank card number will not be touched.
  • Don’t excessively consume the server resources of the target website: For example, open 1000 concurrent threads for a small website
  • Prefer using official API: If the website provides a public API, use the API directly instead of the crawler

6. Summary

In this article we discuss the following core issues:

  1. Why Session and Cookie are needed: Because HTTP is stateless and the server does not remember people by nature.
  2. What is a cookie: a small piece of paper stored locally in the browser, used to attach identity tags
  3. What is Session: Server-side user profile, used to store sensitive data
  4. How ​​do crawlers deal with them: Userequests.Session()Automatically manage cookies withpickleMake persistence
  5. Security and Compliance: Not blocked or illegal, this is the bottom line

An in-depth understanding of the working principles of Session and Cookie is a required course for getting started with crawlers. After mastering these, you will no longer be stuck by the problem of "cannot get data after logging in".


Reference resources

  1. MDN Web Docs - HTTP Cookies
  2. OWASP Session Management Cheat Sheet
  3. requests 库官方文档 - Session Objects