Complete Guide to Integrating Scrapy with Selenium/Playwright

📂 Stage: Stage 3 - Offensive and Defense Drills (Middleware and Anti-Climbing) 🔗 Related chapters: Downloader Middleware · 反爬对抗实战 · 数据去重与增量更新

In modern web development, more and more pages dynamically render content through JavaScript. The traditional Scrapy downloader can only obtain the static HTML returned by the server and cannot execute JavaScript, so it cannot obtain asynchronously loaded data. At this time, we need to use browser automation tools such as Selenium or Playwright to simulate a real browser, completely render the page and obtain data.

This tutorial will take you step by step to understand how to integrate Selenium and Playwright into Scrapy, and share performance optimization and anti-detection techniques so that your crawler can easily deal with dynamic pages.


Table of contents


Overview of Selenium and Playwright

Selenium is a long-established tool in the field of browser automation, with a large community and rich information; Playwright is a cutting-edge framework launched by Microsoft in recent years, with a simpler API and faster running speed. Let’s quickly understand the differences between the two through a comparison table.

FeaturesSeleniumPlaywright
MaturityHigh, lots of tutorials and community supportNewer, but the ecosystem is growing rapidly
PerformanceAverage, high resource usageExcellent, fast startup, low memory usage
API complexityMore cumbersome, requiring manual waitingSimple, built-in intelligent waiting mechanism
Browser supportChrome, Firefox, Edge, etc.Chromium, Firefox, WebKit
Applicable scenariosNeed to be compatible with old projects or multiple browsersNew projects, pursuing high performance and stable crawling

When is browser automation needed?

Not all crawlers need to use browser automation. Only consider introducing Selenium or Playwright when your target page has the following conditions:

  • A large amount of content is dynamically loaded through AJAX/Fetch, and the data cannot be seen when viewing the web page source code directly;
  • Single page applications (SPA), such as pages rendered by Vue and React;
  • Need to simulate complex user interactions, such as clicking buttons, scrolling pages, and filling out forms;
  • Want to capture graphics rendering content such as Canvas, WebGL, etc.;
  • The form needs to be submitted dynamically, and the Token or verification parameters of the page are generated on the front end.

⚠️ The overhead of browser automation is much greater than that of ordinary HTTP requests. Please use it only when you confirm that it is really needed.


Selenium integration solution

The most common way to integrate Selenium into Scrapy is to write a downloader middleware, start the Chrome or Firefox browser in the middleware, then use the browser to load the page, and return the rendered HTML to Scrapy.

Basic Selenium middleware

Here is an example of a ready-to-use Selenium middleware that checks each request formetaIs it marked inuse_selenium, if so, start the headless browser to capture the page, otherwise use the default downloader.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http import HtmlResponse
import time

class SeleniumMiddleware:
    """基础 Selenium 中间件"""

    def __init__(self):
        self.driver = self._create_driver()

    def _create_driver(self):
        """创建并配置 Chrome 驱动"""
        chrome_options = Options()
        chrome_options.add_argument('--headless')          # 无头模式,不打开窗口
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument('--disable-blink-features=AutomationControlled')

        driver = webdriver.Chrome(options=chrome_options)

        # 隐藏 webdriver 属性,降低被检测风险
        driver.execute_script(
            "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
        )
        return driver

    def process_request(self, request, spider):
        """处理需要 Selenium 的请求"""
        if request.meta.get('use_selenium'):
            try:
                self.driver.get(request.url)

                # 等待页面主体加载完成
                wait = WebDriverWait(self.driver, 10)
                wait.until(EC.presence_of_element_located((By.TAG_NAME, "body")))

                # 额外等待一段时间,确保动态内容加载(可替换为更精确的等待)
                time.sleep(2)

                body = self.driver.page_source.encode('utf-8')
                return HtmlResponse(
                    url=request.url,
                    body=body,
                    encoding='utf-8',
                    request=request
                )
            except Exception as e:
                spider.logger.error(f"Selenium error: {str(e)}")
                return request

    def spider_closed(self, spider):
        """爬虫关闭时释放浏览器资源"""
        if self.driver:
            self.driver.quit()

How ​​to use existsettings.pyEnable the middleware in and add parameters in the request that requires browser rendering:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.SeleniumMiddleware': 543,
}

# 在 Spider 中发送带有标识的请求
yield scrapy.Request(
    url='https://example.com/dynamic-page',
    meta={'use_selenium': True},
    callback=self.parse
)

🔧 Tips: You can dynamically determine whether to use Selenium based on whether page elements appear, instead of processing all requests uniformly. For example, when conventional parsing cannot find the target data, it will automatically retry with theuse_seleniumrequest.


Playwright Integration Solution

Playwright provides a simple synchronization API that can be easily encapsulated into Scrapy middleware. Similar to Selenium, we also start the browser instance in the middleware and return the rendered HTML after loading the page.

Basic Playwright middleware

from playwright.sync_api import sync_playwright
from scrapy.http import HtmlResponse

class PlaywrightMiddleware:
    """基础 Playwright 中间件"""

    def __init__(self):
        self.playwright = None
        self.browser = None
        self.setup_browser()

    def setup_browser(self):
        """启动 Playwright 和浏览器实例"""
        self.playwright = sync_playwright().start()
        self.browser = self.playwright.chromium.launch(
            headless=True,
            args=[
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-dev-shm-usage'
            ]
        )

    def process_request(self, request, spider):
        """处理需要 Playwright 的请求"""
        if request.meta.get('use_playwright'):
            try:
                page = self.browser.new_page()
                # 等待网络空闲,确保异步数据加载完成
                page.goto(request.url, wait_until="networkidle")
                page.wait_for_load_state("domcontentloaded")

                content = page.content()
                page.close()  # 及时关闭页面,释放内存

                return HtmlResponse(
                    url=request.url,
                    body=content.encode('utf-8'),
                    encoding='utf-8',
                    request=request
                )
            except Exception as e:
                spider.logger.error(f"Playwright error: {str(e)}")
                return request

    def spider_closed(self, spider):
        """爬虫结束时清理资源"""
        if self.browser:
            self.browser.close()
        if self.playwright:
            self.playwright.stop()

Usage is almost the same as Selenium, set in Spidermeta={'use_playwright': True}That’s it.

💡 Selection Suggestion: If the project is brand new and does not need to be compatible with particularly old browsers, it is recommended to use Playwright first, as the API is more friendly and the performance is better. Selenium is still reliable if your team has extensive experience using Selenium, or if you need to support multiple unusual browsers.


Anti-detection strategy

Directly starting a headless browser with the default configuration can be easily recognized by the website as an automated tool, causing the IP to be blocked or returning to a blank page. Therefore we need to camouflage common fingerprint detection.

Anti-detection configuration fragment

Below is a configuration class that integrates common anti-detection logic and can be applied in Selenium or Playwright.

import random

class AntiDetectionConfig:
    """反检测配置类"""

    @staticmethod
    def get_stealth_args():
        """返回用于启动浏览器的反检测参数"""
        return [
            '--disable-blink-features=AutomationControlled',
            '--disable-dev-shm-usage',
            '--disable-gpu',
            '--disable-extensions',
            '--no-sandbox',
            '--disable-web-security'
        ]

    @staticmethod
    def get_stealth_script():
        """返回在页面中执行的反检测 JS 脚本"""
        return """
        // 隐藏 webdriver 属性
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined,
        });
        // 模拟存在浏览器插件
        Object.defineProperty(navigator, 'plugins', {
            get: () => [1, 2, 3, 4, 5],
        });
        // 模拟浏览器语言设置
        Object.defineProperty(navigator, 'languages', {
            get: () => ['zh-CN', 'zh', 'en'],
        });
        """

    @staticmethod
    def get_realistic_user_agents():
        """返回一组真实浏览器的 User-Agent,供随机选取"""
        return [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0"
        ]

When initializing the browser, passing in these startup parameters and executing the anti-detection script after the page is loaded can effectively bypass most JavaScript-based fingerprint detection.

More advanced anti-detection details

In addition to the above basic operations, you can also consider:

  • Randomize browser viewport size;
  • usepage.evaluate(Playwright) ordriver.execute_script(Selenium) Randomly trigger mouse movement events;
  • Cooperate with the proxy IP pool to make the exit IP of each browser different.

Anti-crawling is a continuous game, and the above strategies need to be continuously adjusted according to the specific conditions of the target site.


##Performance Optimization Tips {#Performance Optimization Tips}

Browser automation is the "heavy weapon" in crawlers and has a huge performance overhead. If not optimized, not only will the crawling speed be slow, but system resources will also be easily exhausted. Here are some practical tips.

Core optimization methods

  1. Reuse browser instance To avoid starting a new browser with every request, you can create it in the initialization of the middleware, reuse the same instance between requests or use a connection pool to manage a small number of instances.

  2. Control the number of concurrencies Concurrency control can be achieved through semaphores or queues, depending on server performance limits on the number of browsers open at the same time (e.g. 2-3).

  3. Cache crawled content For the same URL, which may be requested multiple times in a short period of time, a layer of memory caching can be added to avoid repeated calls to the browser.

  4. Block unnecessary resources Intercepting the loading of irrelevant resources such as images, videos, and fonts can significantly reduce page rendering time. Available in Playwright viapage.routeImplementation, Chrome extension or launch parameters available in Selenium.

  5. Explicit wait instead of fixed sleep useWebDriverWait(Selenium) or Playwright'swait_for_selectorWait method, wait for the element to appear accurately, do not use it casuallytime.sleep()

A simple performance optimization example

class PerformanceOptimizedMiddleware:
    """加入缓存与简单连接池的性能优化中间件示例"""

    def __init__(self):
        self.driver_pool = []
        self.max_pool_size = 3
        self.cache = {}

    def get_cached_result(self, url):
        """读取缓存"""
        return self.cache.get(url)

    def cache_result(self, url, content):
        """写入缓存,并限制最大条目数"""
        if len(self.cache) > 100:
            self.cache.pop(next(iter(self.cache)))
        self.cache[url] = content

You can integrate this logic into the previous Selenium or Playwright middleware, use the cached data first, and return it directly after a hit, thereby bypassing the browser rendering step.


Frequently Asked Questions and Solutions

You may encounter various problems in practice. Here are several types of high-frequency pain points and coping methods.

Browser startup failed

Phenomenon: ChromeDriver or Playwright cannot start the browser, and the console reports an errorsession not createdOr the browser cannot be found.

Solution:

  • Ensure that the locally installed Chrome/Chromium version matches the driver version (Selenium users need to download the corresponding ChromeDriver);
  • When running in a container environment such as Docker, be sure to add--no-sandboxand--disable-dev-shm-usageparameter;
  • If using Playwright, just executeplaywright installThe corresponding browser will be automatically installed to avoid manual configuration.

Memory leak

Phenomena: After the crawler has been running for a period of time, the memory usage continues to increase, and even causes the process to crash.

Solution:

  • Restart the browser instance regularly (for example, recreate it after every N requests);
  • Use connection pools to manage browsers and pages, and close unused pages in a timely manner;
  • In Playwright, explicitly called after each requestpage.close(), to avoid accumulation.

Recognized as an automated tool by the website

Phenomena: Blank content, verification code, or access denied are returned.

Solution:

  • Apply the anti-detection script and startup parameters introduced earlier;
  • Control the request speed, add random delays, and simulate the human operating rhythm;
  • Randomly perform a few scrolls or mouse movements after the page loads (available using Playwright'spage.mouse.move()accomplish);
  • Change User-Agent and proxy IP regularly.

💡 Core Points: Selenium and Playwright are powerful tools for handling JavaScript rendering, but they are also "performance killers" for crawlers. Through reasonable use of caching, connection pooling, resource interception and anti-detection strategies, you can find the best balance between crawling effect and efficiency.


🔗 Recommended related tutorials

🏷️ tag cloud:Scrapy Selenium Playwright JavaScript渲染 动态页面 浏览器自动化 反检测 爬虫优化