Python web crawler development tutorial: 2026 lightweight but practical introduction

This article is a **pure code practice lightweight version**, without any verbose theory, focusing on the latest tool chain run-through process in 2026! All operations are in compliance with compliance recommendations.

Table of contents


1. Overview and application scenarios of crawlers

1.1 One sentence definition

A web crawler is a program that automatically copies web page links + parses content. It is like a "copy-paste spider" with rules that crawls specified data along the link tree/web of the Internet.

1.2 Common scenarios for ordinary people/developers

There is no need to talk about the too mysterious combination of AI+blockchain+metaverse in 2026. Let me first mention a few things that can be used for daily practice/solving small problems:

  • E-commerce monitoring: real-time inventory and price fluctuations before coupon grabbing
  • Content aggregation: When the RSS of your favorite public accounts/blogs is not updated, you can just use crawling summaries.
  • Learning materials: Organize links and introductions to certain types of Python tutorials on GitHub/Nuggets

2. Modern crawler technology stack

2.1 Minimalist core process

There is no need to remember the four complicated links, just simplify them into three small modules:

  1. Send a request to get the source code: Either call the HTTP/HTTPS interface directly (asynchronous is preferred), or use the browser to simulate dynamic rendering
  2. Extract data from source code: Use CSS/XPath/regex to pick out what you want (regex should try not to touch the entire HTML)
  3. Save it for later use: store small data in CSV/JSON, use MongoDB for medium data, and upload big data to the cluster.
ModulesRecommended solutionsWhy it?
Request libraryhttpxSupports one-click switching between HTTP/2, asynchronous, and synchronous, which is more suitable for the modern Web than requests
Parsing libraryparselScrapy's official default parsing library, CSS/XPath can be used, and the syntax is more unified than BeautifulSoup
Browser simulationPlaywrightCross Chrome/Firefox/Safari, automatically install the driver, the code generator is easy to use, and the anti-climbing threshold is lower than Selenium
Data verification/organizingpydantic + pandasUse pydantic for small data verification, and use pandas for larger data sorting and exporting

3. Basics of crawler development

3.1 Environment configuration in 2026 (Python 3.10+ must be installed)

# 1. 创建隔离虚拟环境(不污染全局Python!!!)
python -m venv crawler_env
# 2. 激活虚拟环境
# Linux/Mac
source crawler_env/bin/activate
# Windows PowerShell
.\crawler_env\Scripts\Activate.ps1
# Windows CMD
.\crawler_env\Scripts\activate.bat
# 3. 一键安装工具包
pip install httpx parsel playwright pandas
# 4. 安装Playwright浏览器驱动(只装Chrome也行,但跨浏览器更稳)
playwright install chromium

3.2 5-minute run-through asynchronous example: crawl example.com basic information

This code is lightweight, asynchronous, has error handling, and adds a disguised header, and can be used for practicing or modifying small projects:

import asyncio
import httpx
from parsel import Selector
from typing import Dict, List

# 1. 伪装成普通Chrome浏览器(反反爬第一步!)
DEFAULT_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
}

# 2. 异步获取页面源码(速度比同步快3-10倍,取决于并发数)
async def fetch_page(url: str, headers: Dict = DEFAULT_HEADERS) -> str:
    try:
        async with httpx.AsyncClient(http2=True, follow_redirects=True, timeout=10.0) as client:
            response = await client.get(url, headers=headers)
            # 自动抛出4xx/5xx错误,比如404、502
            response.raise_for_status()
            return response.text
    except Exception as e:
        print(f"获取页面 {url} 失败:{e}")
        return ""

# 3. 用parsel解析数据
async def parse_example(html: str) -> Dict:
    if not html:
        return {}
    selector = Selector(text=html)
    return {
        # CSS选择器:取h1的文本
        "page_title": selector.css("h1::text").get().strip(),
        # CSS选择器:取meta[name="description"]的content属性
        "page_desc": selector.css('meta[name="description"]::attr(content)').get("无描述").strip(),
        # CSS选择器:取所有a标签的href属性(只保留绝对链接练手)
        "valid_links": [link for link in selector.css("a::attr(href)").getall() if link.startswith("http")],
    }

# 4. 主函数
async def main():
    url = "https://example.com"
    html = await fetch_page(url)
    data = await parse_example(html)
    # 打印结果
    print("=== example.com 基础信息 ===")
    print(f"页面标题:{data.get('page_title')}")
    print(f"页面描述:{data.get('page_desc')}")
    print(f"有效链接数:{len(data.get('valid_links', []))}")

if __name__ == "__main__":
    # 启动异步事件循环
    asyncio.run(main())

4. Modern crawler challenges and solutions

4.1 The two most common pitfalls

Pitfall 1: Data comes out after the page is loaded (dynamic rendering)

For example, for Weibo hot searches and Taobao products, there is no data in the source code fetched directly using httpx. It is hidden in JavaScript or requires an interface to return for rendering.

Low-cost solution: Playwright one-click access

There is no need to learn complex JS reverse interfaces. First use Playwright to simulate the browser and wait for 1-2 seconds for the data to come out:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # 启动有头Chrome(反反爬比无头稳一点)
    browser = p.chromium.launch(headless=False, args=["--disable-blink-features=AutomationControlled"])
    # 打开新页面
    page = browser.new_page(
        user_agent=DEFAULT_HEADERS["User-Agent"],
        locale="zh-CN",
    )
    # 跳转目标页面(比如简化版的某电商测试商品页,但这里不能放真实违规的)
    page.goto("https://example.com", wait_until="networkidle")  # 等网络空闲
    # 这里可以用page.locator定位元素,也可以直接取完整源码
    html = page.content()
    browser.close()

Pitfall 2: IP blocked due to access too quickly

Low-cost solution (suitable for practicing):
  1. Add random request interval:import time; time.sleep(random.uniform(1.5, 3.0))
  2. Use a free proxy pool: for examplescrapy-proxiesorproxy-pool(But free is unstable, use with caution for small projects)
  3. Change to a different User-Agent: usefake_useragentLibrary (remember to use the updated version in 2026!)

5. Laws and Ethics

5.1 3 red lines that must not be touched

  1. Robots.txt pages prohibited from crawling: For examplehttps://www.baidu.com/robots.txtinsideDisallowitem
  2. Personal privacy and business confidential data: such as other people’s ID numbers, mobile phone numbers, and undisclosed financial data
  3. Excessive crawling frequency leads to website paralysis: Even for pages that are allowed to be crawled, do not open 1,000 concurrent pages

5.2 Minimum cost compliance guideline (completely sufficient for practice/small projects)

  1. Prioritize to use official API: such as GitHub API and Weibo Open Platform API, which do not require anti-climbing and are still stable.
  2. Request interval ≥ 2 seconds: No problem at all for ordinary small websites
  3. Add your own contact information to User-Agent: For exampleMyCrawler/1.0 (+https://myblog.com/contact), the website administrator can contact you if he or she feels the impact is
  4. Only crawl public, non-profit data: It’s okay to practice, but it’s definitely not okay to crawl and sell it for money.

6. Learning path suggestions

6.1 A minimalist route from getting started to being able to write small projects

  1. Basic (1-2 days):
  • Understand basic HTTP protocols (GET/POST requests, response codes, request headers)
  • Use httpx+parsel to crawl static pages (such as Douban movie TOP250 practice)
  • Export data to CSV/Excel using pandas
  1. Intermediate (3-5 days):
  • Use Playwright to crawl dynamically rendered pages
  • Add random request interval and change User-Agent
  • Verify data format with pydantic
  1. Advanced (Learn on Demand):
  • Scrapy framework (suitable for large-scale crawling)
  • Distributed crawler (Scrapy+Redis)
  • JS reverse engineering (only used when the API cannot be found)
  1. Official Document:
  2. Open Source Practice Project:
  • Douban movie TOP250 crawler (search keywords on GitHub and select those with more stars)
  • A certain technical blog summary aggregation crawler

7. Summary

This article is the lightweight introductory pure practical version of the Python crawler in 2026. It focuses on showing you how to use the latest tool chain to run through the process, avoid common pitfalls, and at the same time clarify the compliance red lines.

I suggest you do the following:

  1. First practice the TOP250 Douban movies (static page, no complicated anti-crawling)
  2. Practice on the dynamically rendered test page
  3. Finally try to use Scrapy to write small projects
Hands-on practice is the only shortcut to learning crawlers. Start with small projects, don’t touch large and complex websites at the beginning!