Playwright crawler tutorial (2024 latest version)
Still having trouble matching Selenium driver versions? Fed up with Puppeteer being limited to Chromium? If you need to scrape dynamically rendered pages, Microsoft's open source Playwright is definitely worth a try. After four years of iteration, it has become the top-notch automation tool, and making dynamic crawlers is incredibly refreshing.
1. Why choose Playwright as your crawler?
Compared with other tools, Playwright almost accurately solves the core pain points of crawler engineers:
- Full engine coverage: Chromium, Firefox, and WebKit (Safari kernel) are all supported. There is no need to change browsers and change codes in order to test anti-crawling strategies.
- Uniform for all platforms: Driver installation for Windows, macOS, and Linux can be done with one command, goodbye
chromedriver.exeTerrible download failure experience. - Intelligent automatic waiting: No need to fill the screen
time.sleep()or complex wait conditions. Playwright waits until an element is interactive before performing an action, greatly reducing script vulnerability. - Anti-crawling friendly: Built-in practical functions such as device fingerprint simulation, geographical location camouflage, and network interception, making it easy to deal with basic anti-crawling.
- Excellent performance: Faster than Selenium, more complete than Puppeteer, and asynchronous API is more capable of high-concurrency collection.
2. Quick installation and configuration
Environmental requirements
Python 3.8 and above are fully supported by mainstream operating systems.
Domestic accelerated installation solution
If the network is smooth, you can go directly to the official source, but domestic users will most likely be stuck in the download-dependent link. It is recommended to use mirror acceleration:
Students with better network environment can directly use the official source:
3. Basic introduction: synchronous VS asynchronous
Playwright provides both synchronous and asynchronous API styles. It is recommended to use synchronous mode for daily small scripts, as the logic is intuitive; for scenarios that need to run dozens of pages at the same time, directly switch to asynchronous mode, which is significantly more efficient.
Synchronous mode: open Baidu with a few lines of code
The following is the simplest getting started example - open Baidu homepage, print the title, and save the screenshot:
withThe statement will automatically manage the browser life cycle, no need to do it manuallyclose(), the code is clean and safe.
Asynchronous mode: more suitable for batch operations
If you want to collect multiple pages at the same time, asynchronous mode can help you drain the network and CPU:
4. Lazy person’s artifact: code generation tool
The most annoying thing about writing a crawler is element positioning? Playwright comes withcodegenRecorder lets you say goodbye to manual search for selectors. Whatever you do in the browser, it will automatically generate the corresponding Python code.
For example, to record the operation of logging into site B, you only need one line of command:
After execution, the Firefox browser and the code generation window will open at the same time. You enter your account and click to log in. The script on the right will be updated in real time, and you can use it directly with slight modifications.
5. Core functions: essential operations for crawling dynamically rendered pages
Element positioning
Playwright supports a variety of positioning strategies, and newbies can start with the most intuitive way:
Waiting mechanism (automatically wait to add ingredients when not enough)
Although Playwright comes with automatic waiting, you may need to manually add it when encountering scenarios such as infinite scrolling and asynchronous loading:
Network blocking: disable images and ads, significantly speeding up
Dynamic pages often contain a large number of useless pictures and ad requests.routeInterception can greatly speed up collection:
6. Practical anti-crawling skills (basic version)
Simulate mobile devices
Many websites have relatively loose anti-crawling measures for mobile devices. Playwright has built-in rich device configurations and one-click camouflage:
Set geographical location
In addition to IP proxy, you can also directly simulate the geographical location through the browser context and cooperate with some sites that require regional judgment:
7. Frequently asked questions and solutions
Q1: Element click fails?
- Check whether the element is blocked by pop-ups or advertisements
- Try force clicking:
page.click(selector, force=True) - First scroll to the visible area of the element:
page.scroll_into_view_if_needed(selector)
Q2: Page loading timeout?
- Extend the timeout:
page.goto(url, timeout=60000) - Block unnecessary resources (such as pictures, advertisements) and reduce waiting
- Check whether the target website requires proxy configuration
Q3: How to deal with verification code?
- For complex verification codes (slider, click, puzzle), it is recommended to connect with a third-party identification service
- Use persistent browser context to save login status and cookies to reduce repeated logins and reduce the chance of triggering verification codes
8. Resource recommendation
Don’t worry if you encounter problems, the official documents are detailed and the Chinese information is also very rich:
Playwright not only can do dynamic crawlers, but is also a good player in the field of automated testing. I hope this tutorial can help you get started quickly, and there will be more advanced gameplay waiting for you to explore in the future!

