道满PythonAI

title: 🛠️ Python crawler practical teaching: Douban Movie Top250 Synchronization and Asynchronous Practical Manual description: In the field of reptiles, efficiency is life. When we need to collect tens of thousands of data, the waiting time of single-threaded synchronous crawling is unacceptable. This article will take you from the most basic synchronization logic to the ultimate concurrency coroutine solution through practical code. https://github.com/Annyfee/spider-js-reverse https://github.com/Annyfee/spider-defense-bypass/tree/main

Foreword: In the world of reptiles, efficiency is king. When the amount of data skyrockets from hundreds to tens of thousands, the "one step and one stop" rhythm of single-threaded synchronous crawling will cause people to collapse. This article will not just throw a bunch of theories at you, but will take you starting from the simplest synchronous crawler, upgrading to multi-threading step by step, and then pushing the performance to the limit - using coroutines to achieve ultra-high throughput. After reading this, you will not only be able to write a fast crawler, but also truly understand the core thinking of concurrent programming.

🛠️ 1. Core tool stack

This case is all about pragmatism, and each library directly addresses the pain points of crawlers:

Data Collection:requests(Synchronized HTTP library, first choice for getting started)/aiohttp(Asynchronous HTTP library, speed king)
Data Analysis:lxml.etree→ Use XPath to accurately locate page content, which is much cleaner and more efficient than regular expressions
Storage Optimization:DataRecorder→ Automatically handle Excel writing, file locking and table headers, perfectly supporting multi-threading/process safety
Data Alignment:itertools.zip_longest→ Prevent all subsequent data from being distorted when a certain movie lacks short reviews.

🧠 2. Core logic of task splitting

The difference in thinking between synchronization and concurrency determines the huge difference in code structure:

Synchronization process: String "send request → parse data → write file" into a line, only do one thing at a time, and then move on after finishing, waiting for I/O throughout the process.
Concurrent/Asynchronous Process: Encapsulate the "request + parsing" of each page into an independent task (atomic), which is only responsible for returning the data list of this page. As for when to execute and how to schedule, everything is left to the thread pool or event loop for unified management. The main thread only does the final summarization and writing to disk, which is clean and efficient.

To put it simply: synchronization is one person moving bricks serially, concurrency is asking a group of people to move bricks at the same time, and coroutine is using one person but letting him switch quickly without stopping.

🚀 3. Practical code implementation of the whole solution

Preparation: one-click environment configuration

Open the terminal and paste the following line of commands to install all dependencies at once:

pip install requests aiohttp lxml DataRecorder openpyxl

1. Synchronous crawling: the starting point of the crawler

The logic is straightforward, and each step is waiting for the network or disk, but it is the easiest version to understand and debug, and is very suitable for figuring out the process.

import os, time
from itertools import zip_longest
from lxml import etree
import requests
from DataRecorder import Recorder


def get_excel(mode):
    filename = f'top250_{mode}.xlsx'
    if os.path.exists(filename): os.remove(filename)
    recorder = Recorder(filename)
    recorder.show_msg = False   # 关闭自动打印，统一管理输出
    return recorder


def run_sync():
    recorder = get_excel('同步')
    session = requests.Session()       # 复用连接，减少握手开销
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36'}
    for j in range(10):
        url = f'https://movie.douban.com/top250?start={j * 25}'
        res = session.get(url, headers=headers).text
        tree = etree.HTML(res)
        # 用 XPath 提取三种信息
        titles = tree.xpath('//ol[@class="grid_view"]//span[@class="title"][1]/text()')
        scores = tree.xpath('//span[@class="rating_num"]/text()')
        comments = tree.xpath('//span[@class="inq"]/text()')
        # zip_longest 保证数据长度一致，缺失项填“无”
        for title, score, comment in zip_longest(titles, scores, comments, fillvalue='无'):
            recorder.add_data({
                '电影名': title,
                '评分': score,
                '短评': comment
            })
        recorder.record()   # 每页写入一次（小项目可以，大数据量请移到循环外）
        print(f"已完成第 {j + 1} 页采集")


if __name__ == '__main__':
    start = time.time()
    run_sync()
    print(f'同步爬取耗时: {time.time() - start:.2f}秒')

2. Multi-threading solution: the first choice for smooth speed increase

Thread pool is the fastest way to upgrade synchronization code. The resource overhead is smaller than multiple processes, and you can continue to use what you are already familiar with.requestslibrary.

from concurrent.futures import ThreadPoolExecutor
import os, time
from itertools import zip_longest
from lxml import etree
import requests
from DataRecorder import Recorder

def get_excel(mode):
    filename = f'top250_{mode}.xlsx'
    if os.path.exists(filename): os.remove(filename)
    recorder = Recorder(filename)
    recorder.show_msg = False
    return recorder

def fetch_page(page_index):
    """单个页面抓取任务：输入页码，返回该页所有电影数据"""
    url = f'https://movie.douban.com/top250?start={page_index*25}'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36',
        'Referer': 'https://movie.douban.com/top250'
    }
    try:
        res = requests.get(url, headers=headers, timeout=10).text
        tree = etree.HTML(res)
        titles = tree.xpath('//ol[@class="grid_view"]//li//div[@class="hd"]/a/span[1]/text()')
        scores = tree.xpath('//span[@class="rating_num"]/text()')
        comments = tree.xpath('//span[@class="inq"]/text()')
        
        page_data = []
        for t, s, c in zip_longest(titles, scores, comments, fillvalue='无'):
            page_data.append({'电影名': t, '评分': s, '短评': c})
        
        print(f"线程已完成第 {page_index + 1} 页抓取")
        return page_data
    except Exception as e:
        print(f"抓取第 {page_index + 1} 页失败: {e}")
        return []

if __name__ == '__main__':
    recorder = get_excel('多线程')
    start = time.time()
    
    # 线程池：并发抓取，map 会保持返回顺序与任务顺序一致
    with ThreadPoolExecutor(max_workers=5) as executor:
        all_results = list(executor.map(fetch_page, range(10)))
    
    # 所有数据统一写入磁盘，避免频繁 I/O
    for page_data in all_results:
        for item in page_data:
            recorder.add_data(item)
    recorder.record()
    
    print(f'\n全部完成！')
    print(f'多线程耗时: {time.time() - start:.2f}秒')

3. Coroutine solution: ultimate performance of single thread

Coroutines are the optimal solution for I/O-intensive tasks: one thread can easily schedule hundreds or thousands of requests, and the CPU and memory overhead are extremely low, but an asynchronous library is required.aiohttpCooperate.

import asyncio
import aiohttp
import os, time
from itertools import zip_longest
from lxml import etree
from DataRecorder import Recorder

def get_excel(mode):
    filename = f'top250_{mode}.xlsx'
    if os.path.exists(filename): os.remove(filename)
    recorder = Recorder(filename)
    recorder.show_msg = False
    return recorder

async def fetch_async(page_index, session):
    """异步抓取单页"""
    url = f'https://movie.douban.com/top250?start={page_index*25}'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36',
        'Referer': 'https://movie.douban.com/top250'
    }
    try:
        async with session.get(url, headers=headers) as resp:
            html = await resp.text()
        tree = etree.HTML(html)
        titles = tree.xpath('//ol[@class="grid_view"]//li//div[@class="hd"]/a/span[1]/text()')
        scores = tree.xpath('//span[@class="rating_num"]/text()')
        comments = tree.xpath('//span[@class="inq"]/text()')
        
        page_data = []
        for t, s, c in zip_longest(titles, scores, comments, fillvalue='无'):
            page_data.append({'电影名': t, '评分': s, '短评': c})
        return page_data
    except Exception as e:
        print(f"协程抓取第 {page_index + 1} 页失败: {e}")
        return []

async def main_async():
    recorder = get_excel('协程')
    start = time.time()
    # 一个 ClientSession 搞定所有请求，复用 TCP 连接
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_async(i, session) for i in range(10)]
        all_results = await asyncio.gather(*tasks)   # 并发执行全部任务
    
    # 汇总写入
    for page_data in all_results:
        for item in page_data:
            recorder.add_data(item)
    recorder.record()
    print(f'协程爬取耗时: {time.time() - start:.2f}秒')

if __name__ == '__main__':
    asyncio.run(main_async())

📊 4. Selection and optimization suggestions

1. Solution selection guide

Scenario	Recommended solution	Reason
Beginner exercises / Data volume < 1000	Synchronization	Simple logic, no concurrency security issues, easy breakpoint debugging
Large and medium-sized projects with pure network I/O	Coroutine	Single thread carries ultra-high concurrency and extremely low resource usage
Requires a lot of calculations, encryption and decryption tasks	Multiple processes	Bypassing the Python GIL, truly draining the multi-core CPU
Don’t want to replace legacy code`requests`	Multi-threading	Small changes, quick results, steady speed-up

2. Pitfall avoidance and optimization suggestions

Control the amount of concurrency: Websites such as Douban have request frequency limits. It is recommended that the number of threads or coroutine concurrency be set between 5 and 10. If you add more, it will be easily blocked.
Write to disk in batches: Do not write items one by one in a looprecord(), after all the data is ready, disk I/O can be reduced by more than 90% by writing to the disk at one time.
Reuse session object:requests.Session()andaiohttp.ClientSession()You can reuse the underlying TCP connection to increase the request speed by about 30%. Do not create a new one every time.
Exception-handling must be in place: Network fluctuations and anti-crawling interception may cause the crawling of a single page to fail, so be sure to use ittry/exceptKeep the bottom line, otherwise you will end up with a fishy soup.

#🛠️ 1. Core tool stack

#🧠 2. Core logic of task splitting

#🚀 3. Practical code implementation of the whole solution

#Preparation: one-click environment configuration

#1. Synchronous crawling: the starting point of the crawler

#2. Multi-threading solution: the first choice for smooth speed increase

#3. Coroutine solution: ultimate performance of single thread

#📊 4. Selection and optimization suggestions

#1. Solution selection guide

#2. Pitfall avoidance and optimization suggestions