Modern Selenium Crawler Tutorial (2024)

Have you ever experienced such a scenario: usingrequestsFetching static pages is extremely fast, but once you encounter a dynamic page with pull-down loading, slider verification, multi-layer iframe nesting, and real-time DOM refresh, you will be completely blind? Don’t worry, this article will take you through the new API of Selenium 4.

1. Core Value — Why choose Selenium 4?

Although rising stars such as Playwright and Puppeteer are gaining momentum, Selenium is still the preferred choice for complex web compatibility scenarios, legacy system automation, and cross-browser verification crawlers for very real reasons:

  • ✅ The most mature ecosystem: supports all mainstream browsers and programming languages, and has a deep community;
  • ✅ The most abundant solutions: ready-made solutions such as anti-detection, Grid distribution, and continuous integration are everywhere;
  • ✅ 4.x has made significant progress: the API is more standardized, natively supports CDP (Chrome DevTools Protocol), and the headless mode has become more stable.

2. Minimalist environment-setup (say goodbye to the nightmare of driver version matching)

There is no need to manually download the driver that strictly corresponds to the browser version.webdriver-managerThis painful part has become history.

2.1 Install dependency packages

# 推荐使用 4.15.0,稳定性非常能打
pip install selenium==4.15.0 webdriver-manager tenacity

💡 tenacityIt will be used for subsequent automatic retries to make the crawler more "solid".

2.2 (Optional) Install Chrome browser

If Chrome is already installed, you can skip it; if it is running on the server, you can quickly install it with the command under Linux:

# Debian/Ubuntu 示例
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb
sudo apt-get install -f   # 自动修复可能缺失的依赖

For Windows/macOS, just go to the official website to download Chrome or Edge.

3. Basic operations: Selenium 4 unified writing method

Selenium 4 deprecates a bunch of old APIs (such asfind_element_by_id), now** must be usedByClasses are positioned in a unified way, and the code is not only more standardized, but also has better cross-version compatibility.

3.1 Start the browser with one click

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# ✅ 现代初始化方式:自动下载并匹配驱动版本
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")   # 窗口最大化,方便定位元素
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

try:
    driver.get("https://www.baidu.com")
    print(f"页面标题:{driver.title}")   # 输出:百度一下,你就知道
finally:
    driver.quit()   # 无论是否出错都要关闭,避免残留进程

3.2 Positioning and interaction of common elements

from selenium.webdriver.common.by import By

# 推荐优先级:ID > NAME > CSS_SELECTOR > XPATH
search_box = driver.find_element(By.ID, "kw")
search_btn = driver.find_element(By.CSS_SELECTOR, "#su")

# 基础交互
search_box.send_keys("Selenium 4 爬虫")  # 模拟打字
search_box.clear()                       # 清空输入框
search_box.send_keys("Python 爬虫")      # 重新输入
search_btn.click()                       # 点击“百度一下”

4. advanced-features: handle difficult scenarios such as anti-crawling and iframe

Just being able to use it is not enough. The following skills are the key to avoiding pitfalls in actual project practice.

4.1 Explicit waiting: throw away unreliable onestime.sleep()

Hard-coded sleep time is not only inefficient, but can also be regarded as a "robot feature" by anti-crawling systems. Explicit wait Only continues when an element reaches a specific state, fast and stable.

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

try:
    # 等待搜索结果的第一条链接变为可点击,最多等 10 秒
    first_result = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, "#content_left h3 a"))
    )
    first_result.click()
finally:
    driver.quit()

4.2 Handling iframes and multi-windows

iframe nesting, clicking a link to pop up to a new tab...these scenarios are very simple using Selenium 4's switching syntax.

# 1️⃣ 切换到 iframe(三种方式:ID/NAME、元素对象、索引)
iframe = driver.find_element(By.ID, "login-frame")
driver.switch_to.frame(iframe)
# 在 iframe 里操作
driver.find_element(By.NAME, "username").send_keys("test")
# 务必记得回到主文档!
driver.switch_to.default_content()

# 2️⃣ 切换到新打开的窗口
main_window = driver.current_window_handle   # 记录当前句柄
driver.find_element(By.LINK_TEXT, "新标签页链接").click()
# 遍历所有句柄,切换到不是主窗口的那一个
for handle in driver.window_handles:
    if handle != main_window:
        driver.switch_to.window(handle)
        break
# 处理完新窗口后可以关闭它,再切回主窗口
driver.close()
driver.switch_to.window(main_window)

4.3 2024 Anti-detection technology: Disguised as a real browser

Most anti-crawling systems work by detecting automation-specific markers such asnavigator.webdriver) to identify you. Selenium 4 hides these characteristics perfectly with the help of CDP.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()

# 1. 隐藏自动化标签
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)

# 2. 伪装成真实用户的 User‑Agent(也可搭配 fake_useragent 库随机化)
options.add_argument(
    "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"
)

service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

# 3. 注入脚本:覆盖 navigator.webdriver 等检测属性
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": """
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        });
        Object.defineProperty(navigator, 'plugins', {
            get: () => [1, 2, 3, 4, 5]
        });
    """
})

⚠️ Remember: Anti-crawling confrontation is a process of continuous upgrading. If you encounter a website with strong detection, you can still cooperateundetected-chromedriverWaiting for customized solutions.

4.4 Robustness guarantee: usetenacityAutomatic retry

Network fluctuations and occasional element loading delays are commonplace, manualtry...exceptNot only is the code ugly, it's also easy to miss. usetenacityThe decorator can implement declarative retry.

from tenacity import retry, stop_after_attempt, wait_fixed
from selenium.common.exceptions import TimeoutException, NoSuchElementException

# 对常见的元素异常自动重试 3 次,每次间隔 2 秒
@retry(
    stop=stop_after_attempt(3),
    wait=wait_fixed(2),
    retry=retry_if_exception_type((TimeoutException, NoSuchElementException))
)
def click_login_button(driver):
    driver.find_element(By.ID, "login-btn").click()

# 使用起来就像普通函数一样
click_login_button(driver)

5. Performance optimization: crawlers run faster and save resources

5.1 Modern headless mode (required for Chrome 112+)

The old headless mode is different from normal browsers in DOM rendering, and Chrome 112 began to introduce it--headless=new, perfectly solved this problem.

options = webdriver.ChromeOptions()
options.add_argument("--headless=new")                         # 新•无头模式
options.add_argument("--window-size=1920,1080")                # 无头模式必须指定窗口大小
options.add_argument("--blink-settings=imagesEnabled=false")   # 禁用图片(节省大量带宽)
options.add_argument("--disable-dev-shm-usage")                # 解决 Docker/Linux 内存限制
options.add_argument("--no-sandbox")                           # 同上,服务端运行必备

Pass these parameters inwebdriver.Chrome(options=options)It can crawl silently and at high speed in the background.

5.2 Mixed Requests + Selenium: Both speed and dynamic rendering

Many websites only require dynamic interaction during the login and cookie acquisition stages, and subsequent data interfaces are fully usable.requestsHigh speed requests. This "heavy front and light back" combination is still a classic approach in 2024.

import requests
from selenium import webdriver

# 1. 先用 Selenium 完成登录,获取 Cookies
driver = webdriver.Chrome(service=service, options=options)
try:
    driver.get("https://需要登录的网站.com")
    # …… 执行登录操作(输入用户名密码、点击登录、等待跳转)……
    # 将 Cookies 格式化为 requests 可用的字典
    cookies = {c["name"]: c["value"] for c in driver.get_cookies()}
finally:
    driver.quit()

# 2. 后续通过 requests 携带 Cookies 高速抓取数据
session = requests.Session()
session.cookies.update(cookies)
response = session.get("https://需要登录的网站.com/api/data")
print(response.json())   # 直接处理 JSON 数据,效率拉满

6. Summary: Correct opening posture of Selenium crawler

  1. Unified API: Fully embrace Selenium 4.xByPositioning, CDP commands;
  2. Anti-detection priority: HidewebdriverTagging, randomizing User‑Agent, avoiding rigid hard waits;
  3. Maximum performance: Use the new headless mode, disable image loading, and combine in appropriate scenariosrequests
  4. Stability first: Replace with explicit waittime.sleep(),CooperatetenacityLet the code come with its own retry capabilities;
  5. Flexible selection: If the goal is just a modern website and no need to cross browsers, Playwright is also worth trying ~

With the method in this article, you can already build a Selenium crawler that is efficient, stable, and has good anti-detection capabilities. If you encounter difficult questions, please feel free to continue the discussion in the comment area~