Playwright crawler tutorial (2024 latest version)

Still having trouble matching Selenium driver versions? Fed up with Puppeteer being limited to Chromium? If you need to scrape dynamically rendered pages, Microsoft's open source Playwright is definitely worth a try. After four years of iteration, it has become the top-notch automation tool, and making dynamic crawlers is incredibly refreshing.


1. Why choose Playwright as your crawler?

Compared with other tools, Playwright almost accurately solves the core pain points of crawler engineers:

  • Full engine coverage: Chromium, Firefox, and WebKit (Safari kernel) are all supported. There is no need to change browsers and change codes in order to test anti-crawling strategies.
  • Uniform for all platforms: Driver installation for Windows, macOS, and Linux can be done with one command, goodbyechromedriver.exeTerrible download failure experience.
  • Intelligent automatic waiting: No need to fill the screentime.sleep()or complex wait conditions. Playwright waits until an element is interactive before performing an action, greatly reducing script vulnerability.
  • Anti-crawling friendly: Built-in practical functions such as device fingerprint simulation, geographical location camouflage, and network interception, making it easy to deal with basic anti-crawling.
  • Excellent performance: Faster than Selenium, more complete than Puppeteer, and asynchronous API is more capable of high-concurrency collection.

2. Quick installation and configuration

Environmental requirements

Python 3.8 and above are fully supported by mainstream operating systems.

Domestic accelerated installation solution

If the network is smooth, you can go directly to the official source, but domestic users will most likely be stuck in the download-dependent link. It is recommended to use mirror acceleration:

# 安装 Playwright Python 库(清华镜像)
pip install playwright -i https://pypi.tuna.tsinghua.edu.cn/simple

# 下载三大浏览器引擎(约 200~300MB,使用 npmmirror 加速)
PLAYWRIGHT_DOWNLOAD_HOST=https://npmmirror.com/mirrors/playwright playwright install

Students with better network environment can directly use the official source:

pip install playwright
playwright install

3. Basic introduction: synchronous VS asynchronous

Playwright provides both synchronous and asynchronous API styles. It is recommended to use synchronous mode for daily small scripts, as the logic is intuitive; for scenarios that need to run dozens of pages at the same time, directly switch to asynchronous mode, which is significantly more efficient.

Synchronous mode: open Baidu with a few lines of code

The following is the simplest getting started example - open Baidu homepage, print the title, and save the screenshot:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # headless=False 显示浏览器窗口,True 为后台运行
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto("https://www.baidu.com")
    print(f"页面标题: {page.title()}")
    page.screenshot(path="baidu_home.png")

withThe statement will automatically manage the browser life cycle, no need to do it manuallyclose(), the code is clean and safe.

Asynchronous mode: more suitable for batch operations

If you want to collect multiple pages at the same time, asynchronous mode can help you drain the network and CPU:

import asyncio
from playwright.async_api import async_playwright

async def open_baidu():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://www.baidu.com")
        print(await page.title())
        await browser.close()

asyncio.run(open_baidu())

4. Lazy person’s artifact: code generation tool

The most annoying thing about writing a crawler is element positioning? Playwright comes withcodegenRecorder lets you say goodbye to manual search for selectors. Whatever you do in the browser, it will automatically generate the corresponding Python code.

For example, to record the operation of logging into site B, you only need one line of command:

playwright codegen --target python -o bilibili_login.py -b firefox https://www.bilibili.com

After execution, the Firefox browser and the code generation window will open at the same time. You enter your account and click to log in. The script on the right will be updated in real time, and you can use it directly with slight modifications.


5. Core functions: essential operations for crawling dynamically rendered pages

Element positioning

Playwright supports a variety of positioning strategies, and newbies can start with the most intuitive way:

# 文本选择(适合反爬较弱的页面)
page.click("text=登录")

# CSS 选择器(和 BeautifulSoup 一样,学习成本为零)
page.fill("#login-username", "your_username")
page.fill(".login-password", "your_password")

# 组合选择器(精准定位包含特定文本的元素)
page.click("article:has-text('Playwright爬虫教程')")

Waiting mechanism (automatically wait to add ingredients when not enough)

Although Playwright comes with automatic waiting, you may need to manually add it when encountering scenarios such as infinite scrolling and asynchronous loading:

# 等待某个元素出现在 DOM 且可见(最常用)
page.wait_for_selector(".dynamic-data-item", timeout=30000)

# 等待网络空闲(所有请求完成)
page.wait_for_load_state("networkidle")

# 极少情况下的强制暂停(能不用就不用)
page.wait_for_timeout(2000)

Network blocking: disable images and ads, significantly speeding up

Dynamic pages often contain a large number of useless pictures and ad requests.routeInterception can greatly speed up collection:

import json

# 拦截所有图片请求
page.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())

# 拦截广告接口
page.route("**/ad/**", lambda route: route.abort())

# 甚至可以直接修改接口响应(用于测试或绕过部分检查)
page.route("**/api/user-info", lambda route: route.fulfill(
    status=200,
    content_type="application/json",
    body=json.dumps({"name": "test_user", "level": 10})
))

6. Practical anti-crawling skills (basic version)

Simulate mobile devices

Many websites have relatively loose anti-crawling measures for mobile devices. Playwright has built-in rich device configurations and one-click camouflage:

with sync_playwright() as p:
    iphone_15 = p.devices["iPhone 15 Pro"]
    # 搭配 WebKit(Safari 内核)更逼真
    browser = p.webkit.launch(headless=False)
    context = browser.new_context(**iphone_15)
    page = context.new_page()
    page.goto("https://www.taobao.com")
    page.screenshot(path="taobao_mobile.png")

Set geographical location

In addition to IP proxy, you can also directly simulate the geographical location through the browser context and cooperate with some sites that require regional judgment:

context = browser.new_context(
    **iphone_15,
    # 北京天安门坐标
    geolocation={"latitude": 39.913904, "longitude": 116.39014},
    permissions=["geolocation"]
)

7. Frequently asked questions and solutions

Q1: Element click fails?

  • Check whether the element is blocked by pop-ups or advertisements
  • Try force clicking:page.click(selector, force=True)
  • First scroll to the visible area of ​​the element:page.scroll_into_view_if_needed(selector)

Q2: Page loading timeout?

  • Extend the timeout:page.goto(url, timeout=60000)
  • Block unnecessary resources (such as pictures, advertisements) and reduce waiting
  • Check whether the target website requires proxy configuration

Q3: How to deal with verification code?

  • For complex verification codes (slider, click, puzzle), it is recommended to connect with a third-party identification service
  • Use persistent browser context to save login status and cookies to reduce repeated logins and reduce the chance of triggering verification codes

8. Resource recommendation

Don’t worry if you encounter problems, the official documents are detailed and the Chinese information is also very rich:


Playwright not only can do dynamic crawlers, but is also a good player in the field of automated testing. I hope this tutorial can help you get started quickly, and there will be more advanced gameplay waiting for you to explore in the future!