Python 异步爬虫教程 (2024 更新版)

1. 异步爬虫概述

1.1 为什么需要异步爬虫

爬虫是典型的 IO 密集型任务，传统同步爬虫在发出请求后必须等待响应返回才能继续执行，导致大量时间浪费在等待上。异步爬虫通过协程技术可以显著提高爬取效率。

1.2 异步爬虫优势

高并发：单线程即可处理大量并发请求
低资源消耗：相比多线程/进程更节省资源
高性能：IO 等待时可切换执行其他任务

2. 核心概念

2.1 协程 (Coroutine)

协程是用户态的轻量级线程，具有以下特点：

由程序控制调度，非抢占式
切换开销远小于线程切换
单线程即可实现高并发

2.2 事件循环 (Event Loop)

异步编程的核心机制，负责：

调度协程执行
处理IO事件
执行回调函数

2.3 async/await 语法

Python 3.5+ 引入的异步编程语法：

async def：定义异步函数
await：挂起协程，等待异步操作完成

3. 异步HTTP客户端

3.1 aiohttp 基础用法

import aiohttp
import asyncio

async def fetch(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            return await response.text()

async def main():
    html = await fetch('http://example.com')
    print(html)

asyncio.run(main())

3.2 新版特性 (aiohttp 3.8+)

支持 HTTP/2
改进的连接池管理
更好的错误处理机制
增强的代理支持

4. 实战案例

4.1 高性能爬虫实现

import aiohttp
import asyncio
from bs4 import BeautifulSoup

async def fetch_page(session, url):
    try:
        async with session.get(url, timeout=10) as response:
            if response.status == 200:
                return await response.text()
            return None
    except Exception as e:
        print(f"Error fetching {url}: {str(e)}")
        return None

async def parse_page(html):
    soup = BeautifulSoup(html, 'html.parser')
    # 提取数据逻辑
    return data

async def crawl(urls):
    connector = aiohttp.TCPConnector(limit=100)  # 控制并发量
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [fetch_page(session, url) for url in urls]
        pages = await asyncio.gather(*tasks)
        
        parse_tasks = [parse_page(html) for html in pages if html]
        results = await asyncio.gather(*parse_tasks)
        return results

4.2 性能优化技巧

连接池管理：

connector = aiohttp.TCPConnector(
    limit=100,  # 最大连接数
    limit_per_host=20,  # 单域名最大连接
    force_close=False,  # 保持长连接
    enable_cleanup_closed=True  # 自动清理关闭的连接
)

超时设置：

timeout = aiohttp.ClientTimeout(
    total=30,  # 总超时
    connect=10,  # 连接超时
    sock_connect=10,  # socket连接超时
    sock_read=10  # socket读取超时
)

重试机制：

from aiohttp_retry import RetryClient, ExponentialRetry

async with RetryClient(
    retry_options=ExponentialRetry(attempts=3)
) as client:
    async with client.get(url) as response:
        ...

5. 高级主题

5.1 异步数据库访问

# 使用 asyncpg 访问 PostgreSQL
import asyncpg

async def query_db():
    conn = await asyncpg.connect('postgresql://user:pass@localhost/db')
    result = await conn.fetch('SELECT * FROM table')
    await conn.close()
    return result

5.2 分布式异步爬虫

# 使用 Redis 作为消息队列
import aioredis

async def process_queue():
    redis = await aioredis.create_redis_pool('redis://localhost')
    while True:
        url = await redis.lpop('url_queue')
        if not url:
            break
        # 处理URL

5.3 最新技术整合

Playwright 异步支持：

from playwright.async_api import async_playwright

async def scrape_js_site():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto('http://example.com')
        content = await page.content()
        await browser.close()
        return content

HTTPX 异步客户端：

import httpx

async def fetch_httpx():
    async with httpx.AsyncClient() as client:
        response = await client.get('http://example.com')
        return response.text

6. 最佳实践

速率控制：

from aiolimiter import AsyncLimiter

limiter = AsyncLimiter(10, 1)  # 每秒10个请求

async def limited_request(url):
    async with limiter:
        return await fetch(url)

错误处理：

async def robust_fetch(session, url, retries=3):
    for attempt in range(retries):
        try:
            async with session.get(url) as response:
                return await response.text()
        except Exception as e:
            if attempt == retries - 1:
                raise
            await asyncio.sleep(1)

监控与日志：

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

logger = logging.getLogger('async_crawler')

async def logged_fetch(url):
    logger.info(f'Fetching {url}')
    # 请求逻辑

7. 性能对比

测试环境：

Python 3.10
aiohttp 3.8.5
100个请求 (https://httpbin.org/delay/1)

方法	耗时(秒)	内存占用(MB)
同步(requests)	102.4	12
多线程(10线程)	10.8	45
异步(aiohttp)	1.2	15

8. 总结

异步爬虫技术已经成为现代爬虫开发的标准配置，通过合理使用协程和异步IO，可以大幅提升爬虫性能。本教程涵盖了从基础概念到高级实践的完整内容，帮助开发者构建高性能的异步爬虫系统。

#Python 异步爬虫教程 (2024 更新版)

#1. 异步爬虫概述

#1.1 为什么需要异步爬虫

#1.2 异步爬虫优势

#2. 核心概念

#2.1 协程 (Coroutine)

#2.2 事件循环 (Event Loop)

#2.3 async/await 语法

#3. 异步HTTP客户端

#3.1 aiohttp 基础用法

#3.2 新版特性 (aiohttp 3.8+)

#4. 实战案例

#4.1 高性能爬虫实现

#4.2 性能优化技巧

#5. 高级主题

#5.1 异步数据库访问

#5.2 分布式异步爬虫

#5.3 最新技术整合

#6. 最佳实践

#7. 性能对比

#8. 总结

#进一步学习资源