Modern web crawler technology: Session + Cookie authenticated simulated login

1. Preface

If you just started learning reptiles, you must have encountered this kind of embarrassment: You can grab the public page without any problem; but once you want to access your orders, favorites, and member area, you can either401 Unauthorized, or return a large "Please log in first" jump prompt.

At this time many people will panic. In fact, Session + Cookie authentication, as one of the oldest and most popular login solutions, is the easiest and most reliable breakthrough for novices.

This article will take you through a complete simulated login step by step using a real example website, from principles to code. At the end of the article, best practices and pitfall lists commonly used in production projects will also be given, so that you will be well aware of similar scenarios in the future.


2. Technical preparation

First set up the operating environment, the dependencies are not complicated.

2.1 Environmental requirements

  • Python 3.8+ (3.10 is recommended for better requests compatibility)
  • A modern browser (Chrome or Edge will work, for developer tool debugging)
  • You don't need to download Chromedriver manually, laterwebdriver-managerwill be processed automatically

2.2 Install dependencies

pip install requests selenium webdriver-manager beautifulsoup4

beautifulsoup4It is just for visually verifying the login result later. You don’t need to install it if you don’t want to read the page content.


3. Analyze the target website

The practice site we used this time is Teacher Cui Qingcai’s simulated login demonstration site:

https://login2.scrape.center/

It has no verification code and no complex JS encryption, but it completely replicates the entire process of real Session + Cookie login, which is especially suitable for getting started.

3.1 Quick overview of site behavior

  • Traditional MVC architecture, the page is rendered directly by the backend
  • Pure Session + Cookie authentication, does not involve modern solutions such as OAuth/JWT
  • After successful login, you will be redirected to the homepage with 302https://login2.scrape.center/, showing a list of movies

3.2 Disassemble the login process (use DevTools to check the door)

The most taboo thing about doing a crawler is to write code as soon as you start. First open Chrome's developer tools (F12 → Network), and let's see the entire login process step by step:

  1. Check Preserve log (Important! Otherwise the redirection log will be lost)
  2. Manually enter the account passwordadmin / admin, click login
  3. Filter the request type in the Network panel toDoc, only see document requests

You will see three key requests, which correspond to a complete login authentication chain:

StepsMethodURLFunction
1POST/loginSubmit username and password, back-end verification
2302/After the verification is passed, the backend is forced to jump to the home page
3GET/After redirection, when requesting the homepage, you must bring the Session Cookie provided by the backend to render normally

Click on the 1st POST requestResponse Headers, you will see a very important line of response headers:

Set-Cookie: sessionid=abc123xyz; Path=/; HttpOnly

This is the "access card" sent by the backend——sessionid. After that, every time the browser requests the same domain name, it will automatically hang it inCookieTake it over.

Core Points: In Session + Cookie mode, the backend will only pass when the login is successful.Set-CookieThe identifier is issued once; subsequent communications rely on the browser to carry this cookie. What the crawler has to do is to imitate the browser, save this "authentication credential", and return it unchanged in subsequent requests.


4. Three ways to implement simulated login

From the most common pitfalls for novices to the recommended production-level solutions, we will show them one by one.

4.1 Error-prone entry-level version: Manually manage cookies (❌ not recommended)

Many beginners do this: first send a POST request and extract theSet-Cookie, and then manually stuff it into subsequent GET requests. Although it can be run through, each request is like a newly opened browser window, and cookie management is cumbersome and error-prone.

import requests
from bs4 import BeautifulSoup

LOGIN_URL = "https://login2.scrape.center/login"
INDEX_URL = "https://login2.scrape.center/"
USERNAME = "admin"
PASSWORD = "admin"

# 1. 发送登录请求,禁止自动重定向(否则拿不到 Set-Cookie 响应)
login_resp = requests.post(
    LOGIN_URL,
    data={"username": USERNAME, "password": PASSWORD},
    allow_redirects=False
)

# 2. 手动提取 Cookie
cookies = login_resp.cookies

# 3. 再用这些 Cookie 去请求首页
index_resp = requests.get(INDEX_URL, cookies=cookies)

# 验证登录是否成功
soup = BeautifulSoup(index_resp.text, "html.parser")
print(soup.find("h2", class_="mb-3"))  # 正常应输出:<h2>欢迎回来,admin</h2>

The biggest problem with this method is not "can it run", but "high maintenance cost" - cookies must be updated manually when they expire, and each request must be explicitly passed. Once the request link becomes longer, the code will become a mess.

requestsCurry brought his ownSessionObject, like a persistent browser session. It will automatically help you track all cookies, request headers, connection pools and other information. You only need to log in once, and all subsequent requests can be processed directly using this "session".

import requests
from bs4 import BeautifulSoup

LOGIN_URL = "https://login2.scrape.center/login"
INDEX_URL = "https://login2.scrape.center/"
USERNAME = "admin"
PASSWORD = "admin"

# 1. 创建一个持久会话
session = requests.Session()

# 2. 为整个会话设置一个通用的 User-Agent(避免底层默认 UA)
session.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/128.0.0.0 Safari/537.36"
    )
})

# 3. 用 Session 发送 POST 请求(不需要禁止重定向,Session 会自动处理)
login_resp = session.post(
    LOGIN_URL,
    data={"username": USERNAME, "password": PASSWORD}
)
print("登录后首页状态码:", login_resp.status_code)  # 正常为 200

# 4. 继续使用同一个 Session 请求首页(Cookie 自动携带)
index_resp = session.get(INDEX_URL)
soup = BeautifulSoup(index_resp.text, "html.parser")
print(soup.find("h2", class_="mb-3"))

**Why is this method recommended? **

  • Simple code, no need to manually manage cookies
  • The status between requests is automatically maintained and is not easy to miss.
  • You can easily add custom request headers, proxies and other configurations, which will take effect for the entire session

Most small and medium-sized websites and internal management backend logins userequests.SessionThat's it.

4.3 Advanced combination version: Selenium retrieves Cookies + Requests for efficient crawling (✅ necessary for advanced)

Some websites encounter graphical verification codes, slider verification, fingerprint detection, or complex JS encryption when logging in. In this case, directly userequestsIt is very difficult, if not impossible, to forge a login request.

At this time you can adopt a "hybrid engine" idea:

  1. First use Selenium to open a real browser and simulate user operations to complete the login
  2. After logging in, export the cookies in the browser
  3. Feed Cookie torequests.Session, all subsequent crawling is handed over to lightweight HTTP requests

In this way, you can bypass complex human-machine verification and enjoyrequestsHigh-speed crawling.

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

LOGIN_URL = "https://login2.scrape.center/login"
INDEX_URL = "https://login2.scrape.center/"
USERNAME = "admin"
PASSWORD = "admin"

# ========== 第一步:Selenium 登录并导出 Cookie ==========
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.maximize_window()

try:
    driver.get(LOGIN_URL)
    time.sleep(1)  # 等待页面完全加载

    # 模拟输入与点击
    driver.find_element(By.CSS_SELECTOR, 'input[name="username"]').send_keys(USERNAME)
    driver.find_element(By.CSS_SELECTOR, 'input[name="password"]').send_keys(PASSWORD)
    driver.find_element(By.CSS_SELECTOR, 'button[type="submit"]').click()
    time.sleep(2)  # 等待跳转完成

    if driver.current_url == INDEX_URL:
        print("Selenium 模拟登录成功!")
        selenium_cookies = driver.get_cookies()
    else:
        print("Selenium 模拟登录失败!")
        selenium_cookies = []
finally:
    driver.quit()

# ========== 第二步:将 Cookie 倒入 Requests Session ==========
if selenium_cookies:
    session = requests.Session()
    session.headers.update({
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/128.0.0.0 Safari/537.36"
        )
    })

    # 逐个注入 Cookie
    for cookie in selenium_cookies:
        session.cookies.set(
            name=cookie["name"],
            value=cookie["value"],
            domain=cookie.get("domain", ""),
            path=cookie.get("path", "/")
        )

    # 验证 + 抓取首页前 3 个电影标题
    index_resp = session.get(INDEX_URL)
    soup = BeautifulSoup(index_resp.text, "html.parser")
    print("首页前 3 个电影:")
    for i, movie in enumerate(soup.find_all("h5", class_="card-title"), 1):
        print(f"{i}. {movie.text.strip()}")

Applicable scenarios: The login phase needs to handle complex human-machine verification, but most of the pages after login are static structures or simple asynchronous loading. Performance Advantage: Selenium is only responsible for the most troublesome login process, and large-scale crawling still usesrequests, much faster than fully controlling the browser.


5. Pitfall avoidance guide & production-level best practices

Even if the sample website does not have any anti-crawling, the real website will not be so "gentle". The following few experiences can save you a lot of detours.

  1. Persistent Storage Serialize the cookies obtained through login into a JSON file or store them in the database. The next time you start the crawler, try to load these cookies first. If they have not expired, use them directly to avoid frequent logins.
  2. Cookie Pool If the collection volume is large, you can prepare multiple accounts in advance, each account has a set of valid cookies, and they will be randomly rotated during capture. It not only reduces the risk of a single account being blocked, but also improves the overall stability.
  3. Active detection of effectiveness Sessions on many sites have a fixed validity period (for example, 24 hours). Before each batch crawl, you can first request an interface that can only be accessed after logging in; if you return401, the re-login process will be automatically triggered.

5.2 Necessary details for anti-climbing confrontation

  • User‑Agent must be changed python-requests/x.x.xSuch a default UA is almost "self-reporting" and must be changed to the value of the real browser.
  • Referer Don’t forget to bring If the business logic depends on the request source, such as jumping from the list page to the details page, remember to change the request headerRefererSet to the URL of the list page.
  • Random delay Join between consecutive requeststime.sleep(random.uniform(1, 3)), simulating the browsing rhythm of real people. Abrupt "uninterrupted" access can easily trigger frequency limits.
  • Control crawling speed and total amount Even if delays are added, try to avoid requesting hundreds of thousands of data in a short period of time. It not only reduces the pressure on the server, but also reduces the risk of your own IP being blocked.

6. Summary

Although the technology of Session + Cookie authentication is "old", it is more stable, easy to debug, and has wide applicability. It is still the preferred login method for a large number of websites.

For novices, the learning sequence is recommended:

  1. Master it first requests.Session, it can solve 80% of login scenarios around you;
  2. When encountering verification codes and complex JS, introduce the combination of Selenium cookie + Requests crawl.

As long as you clarify the "authentication credentials" line of the request link, most websites that "require login" will not be a problem for you.


Extended Reading (Stay tuned):

  • Simulated login with OAuth 2.0 authentication
  • JWT certified crawler processing
  • Crawling strategy for WebAssembly websites