Modern Selenium Crawler Tutorial (2024)
Have you ever experienced such a scenario: usingrequestsFetching static pages is extremely fast, but once you encounter a dynamic page with pull-down loading, slider verification, multi-layer iframe nesting, and real-time DOM refresh, you will be completely blind? Don’t worry, this article will take you through the new API of Selenium 4.
1. Core Value — Why choose Selenium 4?
Although rising stars such as Playwright and Puppeteer are gaining momentum, Selenium is still the preferred choice for complex web compatibility scenarios, legacy system automation, and cross-browser verification crawlers for very real reasons:
- ✅ The most mature ecosystem: supports all mainstream browsers and programming languages, and has a deep community;
- ✅ The most abundant solutions: ready-made solutions such as anti-detection, Grid distribution, and continuous integration are everywhere;
- ✅ 4.x has made significant progress: the API is more standardized, natively supports CDP (Chrome DevTools Protocol), and the headless mode has become more stable.
2. Minimalist environment-setup (say goodbye to the nightmare of driver version matching)
There is no need to manually download the driver that strictly corresponds to the browser version.webdriver-managerThis painful part has become history.
2.1 Install dependency packages
💡
tenacityIt will be used for subsequent automatic retries to make the crawler more "solid".
2.2 (Optional) Install Chrome browser
If Chrome is already installed, you can skip it; if it is running on the server, you can quickly install it with the command under Linux:
For Windows/macOS, just go to the official website to download Chrome or Edge.
3. Basic operations: Selenium 4 unified writing method
Selenium 4 deprecates a bunch of old APIs (such asfind_element_by_id), now** must be usedByClasses are positioned in a unified way, and the code is not only more standardized, but also has better cross-version compatibility.
3.1 Start the browser with one click
3.2 Positioning and interaction of common elements
4. advanced-features: handle difficult scenarios such as anti-crawling and iframe
Just being able to use it is not enough. The following skills are the key to avoiding pitfalls in actual project practice.
4.1 Explicit waiting: throw away unreliable onestime.sleep()
Hard-coded sleep time is not only inefficient, but can also be regarded as a "robot feature" by anti-crawling systems. Explicit wait Only continues when an element reaches a specific state, fast and stable.
4.2 Handling iframes and multi-windows
iframe nesting, clicking a link to pop up to a new tab...these scenarios are very simple using Selenium 4's switching syntax.
4.3 2024 Anti-detection technology: Disguised as a real browser
Most anti-crawling systems work by detecting automation-specific markers such asnavigator.webdriver) to identify you. Selenium 4 hides these characteristics perfectly with the help of CDP.
⚠️ Remember: Anti-crawling confrontation is a process of continuous upgrading. If you encounter a website with strong detection, you can still cooperate
undetected-chromedriverWaiting for customized solutions.
4.4 Robustness guarantee: usetenacityAutomatic retry
Network fluctuations and occasional element loading delays are commonplace, manualtry...exceptNot only is the code ugly, it's also easy to miss. usetenacityThe decorator can implement declarative retry.
5. Performance optimization: crawlers run faster and save resources
5.1 Modern headless mode (required for Chrome 112+)
The old headless mode is different from normal browsers in DOM rendering, and Chrome 112 began to introduce it--headless=new, perfectly solved this problem.
Pass these parameters inwebdriver.Chrome(options=options)It can crawl silently and at high speed in the background.
5.2 Mixed Requests + Selenium: Both speed and dynamic rendering
Many websites only require dynamic interaction during the login and cookie acquisition stages, and subsequent data interfaces are fully usable.requestsHigh speed requests. This "heavy front and light back" combination is still a classic approach in 2024.
6. Summary: Correct opening posture of Selenium crawler
- Unified API: Fully embrace Selenium 4.x
ByPositioning, CDP commands; - Anti-detection priority: Hide
webdriverTagging, randomizing User‑Agent, avoiding rigid hard waits; - Maximum performance: Use the new headless mode, disable image loading, and combine in appropriate scenarios
requests; - Stability first: Replace with explicit wait
time.sleep(),CooperatetenacityLet the code come with its own retry capabilities; - Flexible selection: If the goal is just a modern website and no need to cross browsers, Playwright is also worth trying ~
With the method in this article, you can already build a Selenium crawler that is efficient, stable, and has good anti-detection capabilities. If you encounter difficult questions, please feel free to continue the discussion in the comment area~

