title: Python crawler practice: PyQuery + MongoDB guide description: Build a search engine using Elasticsearch

Python crawler practice: PyQuery + MongoDB guide

1. Preface

When you first get started with crawlers, the three most frustrating things are often:

  • The list is stuck as soon as I turn the page, 10 pages of data can be run for half an hour and errors are reported frequently;
  • If you run it repeatedly, there will be hundreds of duplicate movies in MongoDB, and the data will be too dirty to be used;
  • I want to extract "release time" from HTML, but the resulting tag has neither id nor class. The positioning depends entirely on guessing, and the code will be completely useless after a revision.

This article will use a lightweight but covering industrial-level basic logic movie data crawler prototype to solve these three problems at once:

  • Use PyQuery to quickly parse HTML like writing jQuery;
  • Use MongoDB's upsert mechanism to automatically remove duplicates and update;
  • Use Python's built-in multi-process pool to bypass the GIL and achieve page-level concurrent collection.

The entire code does not exceed 150 lines, but the routine can be directly reused in your own projects.


2. Preparation

Set up the environment first, don't wait until you are halfway through writing to check for errors.

Dependency library installation

# 三个核心库:网络请求、HTML 解析、NoSQL 存储
pip install requests pyquery pymongo

MongoDB startup

After installing MongoDB locally, start the service through the command line or a visual tool (such as MongoDB Compass). The default listening port is27017, we will connect directly from behindmongodb://localhost:27017


3. Complete code + module disassembly

The following is the complete code with super detailed Chinese comments, which is run through in the order of "Configuration → General Request → Index Page Processing → Detail Page Processing → Data Storage → Concurrent Scheduling". It is recommended to scan the whole thing first, and then focus on the disassembled modules later.

import requests
import logging
import re
import pymongo
from pyquery import PyQuery as pq
from urllib.parse import urljoin
import multiprocessing

# ======================== 1. 全局配置 ========================
# 日志配置:带时间戳、级别,方便定位爬取/解析中的问题
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s: %(message)s')

# 常量配置(所有可修改的都集中在这里,后续换目标网站只需改这些)
BASE_URL = 'https://ssr1.scrape.center'  # 崔老师的公开爬虫测试站
TOTAL_PAGE = 10                             # 要爬的总页数
MONGO_URI = 'mongodb://localhost:27017'    # MongoDB 连接地址
MONGO_DB = 'scrape_movies'                  # 数据库名
MONGO_COL = 'movies'                         # 集合名

# ======================== 2. MongoDB 初始化 ========================
client = pymongo.MongoClient(MONGO_URI)
db = client[MONGO_DB]
collection = db[MONGO_COL]

# ======================== 3. 通用请求模块 ========================
def scrape_page(url):
    """
    通用的单页抓取函数:设置超时、捕获异常,
    成功返回 HTML 文本,否则返回 None。
    """
    logging.info('🚀 开始抓取: %s', url)
    try:
        # 超时必须设置,防止某个网页卡住整个进程
        resp = requests.get(url, timeout=10)
        if resp.status_code == 200:
            return resp.text
        logging.error('❌ 状态码错误: %d, URL: %s', resp.status_code, url)
    except requests.RequestException as e:
        logging.error('❌ 请求异常: %s, URL: %s', str(e), url, exc_info=False)
    return None

# ======================== 4. 列表页(索引页)处理 ========================
def scrape_index(page_num):
    """抓取指定页码的列表页,复用通用请求函数"""
    index_url = f"{BASE_URL}/page/{page_num}"
    return scrape_page(index_url)

def parse_index(html):
    """
    解析列表页:用 PyQuery 的 CSS 选择器提取所有详情页 URL。
    使用 yield 返回生成器,避免一次性存储大量 URL 撑爆内存。
    """
    doc = pq(html)
    # 定位电影卡片里的标题链接(测试站结构很清晰)
    link_nodes = doc('.el-card .name')
    for node in link_nodes.items():
        # 将相对路径转为绝对路径
        detail_url = urljoin(BASE_URL, node.attr('href'))
        logging.info('📍 发现详情页: %s', detail_url)
        yield detail_url

# ======================== 5. 详情页处理 ========================
def scrape_detail(url):
    """抓取详情页,直接复用通用请求函数"""
    return scrape_page(url)

def parse_detail(html):
    """
    解析详情页:清洗并提取成结构化的字典。
    重点:处理缺失值、利用 :contains() 定位无 class/id 的元素。
    """
    doc = pq(html)
    # 1. 封面图 URL
    cover = doc('img.cover').attr('src')
    # 2. 电影名
    name = doc('a > h2').text()
    # 3. 分类列表
    categories = [item.text() for item in doc('.categories button span').items()]
    # 4. 上映时间(有些电影可能没有,且位置不固定)
    published_info = doc('.info:contains(上映)').text()  # PyQuery 独有的 :contains 选择器
    published_at = None
    if published_info:
        match = re.search(r'\d{4}-\d{2}-\d{2}', published_info)
        published_at = match.group(1) if match else None
    # 5. 剧情简介
    drama = doc('.drama p').text()
    # 6. 评分(转为浮点数,方便后续排序/分析)
    score = doc('p.score').text()
    score = float(score.strip()) if score and score.strip() else None

    return {
        'cover': cover,
        'name': name,
        'categories': categories,
        'published_at': published_at,
        'drama': drama,
        'score': score
    }

# ======================== 6. 数据存储模块 ========================
def save_movie(data):
    """
    存储电影数据,MongoDB upsert 模式:
    - 以“电影名”作为唯一标识;
    - 存在则更新,不存在则插入,彻底告别重复数据。
    """
    if not data:
        return
    result = collection.update_one(
        {'name': data['name']},
        {'$set': data},
        upsert=True
    )
    if result.upserted_id:
        logging.info('✅ 新增电影: %s', data['name'])
    else:
        logging.info('🔄 更新电影: %s', data['name'])

# ======================== 7. 单页完整逻辑 + 多进程调度 ========================
def process_page(page_num):
    """
    单页的完整流程:
    抓列表页 → 解析出详情页 URL → 逐个抓取详情页 → 解析 → 存储
    """
    index_html = scrape_index(page_num)
    if not index_html:
        return
    detail_urls = parse_index(index_html)
    for url in detail_urls:
        detail_html = scrape_detail(url)
        if not detail_html:
            continue
        movie_data = parse_detail(detail_html)
        save_movie(movie_data)

if __name__ == '__main__':
    """
    多进程启动入口,必须写在 if __name__ == '__main__' 里!
    Pool() 默认开启 CPU 核心数个进程,爬虫属于 I/O 密集型,可以多开一些。
    """
    logging.info('🎬 电影爬虫正式启动!')
    pool = multiprocessing.Pool(processes=5)  # 测试站较小,开 5 个进程足够
    pages = range(1, TOTAL_PAGE + 1)
    pool.map(process_page, pages)  # 自动将每个页码传给 process_page
    pool.close()  # 关闭进程池,不再接受新任务
    pool.join()   # 等待所有子进程执行完毕
    logging.info('🎉 所有任务完成!')

Code Reading Tips: The above code can be copied directly and run (provided that MongoDB is started and the target test station is accessible). Below we will select several key modules and conduct an in-depth analysis of their design ideas and common pitfalls.


4. Core technology highlights

This set of code seems simple, but each module hides the best practices for crawler development. Let’s pick the 3 most important ones to discuss in detail.

🎨 PyQuery:contains()selector

When parsing HTML, the biggest headache is that there is no id/class, and you can only rely on text content to locate elements (such as the "release" information in this example). The traditional approach is to count the tag order or write complex XPath. Once the page structure is fine-tuned, the code will be completely useless.

PyQuery directly moved jQuery:contains()Syntax, one line of code to solve the problem:

# ❌ 不推荐:依赖固定位置,页面一改就错
# doc('.info').eq(3).text()

# ✅ 推荐:根据文本内容动态定位
published_info = doc('.info:contains(上映)').text()

Usage Suggestions::contains()Not only can it be positioned, but it can also be used for final extraction with regular expressions, making it very suitable for processing semi-structured fields.

🛡️ MongoDB’s upsert deduplication/update

Commonly used by novicesinsert_one()There are two big pitfalls:

  1. If the collection has a unique index, an error will be reported if the same key is inserted repeatedly;
  2. If there is no unique index, duplicate data will be stuffed into it every time it is run, and the database will be full of garbage after a few days.

update_one(..., upsert=True)Then perfectly avoid:

-Use first{'name': data['name']}As a search condition (natural unique key);

  • Execute when found$setUpdate all fields (for example, if the rating changes, it will be automatically synchronized);
  • If not found, insert a new document.

Whether it is the first crawl or incremental update, the same function will always be used.

⚡ Multi-process pool scheduling, breaking through GIL limitations

Because of the existence of GIL in Python, only one CPU core is working at the same time in a single thread. But the crawler spends 90% of its time waiting for network response, which is I/O blocking, and the CPU is actually idle.

At this timemultiprocessing.Pool()Multi-process comes in handy:

  • Each process has its own GIL, which can truly utilize multi-core CPUs;
  • While process A is waiting for the list page to return, the CPU can switch to process B to parse the details page;
  • While process B is waiting for the details page to return, process C can save data and initiate new requests.

Willprocess_pageThrow it into the process pool and usepool.map()By processing all page numbers in parallel, the total time consumption can be reduced to 1/4 of the original or even lower (depending on the response speed and anti-crawling strategy of the target website).


5. Extensible direction (advanced)

The above prototype is enough for basic data collection, but in an actual production environment, you can continue to add modules:

  1. Disguise identity: Maintain the User-Agent pool and randomly switch Referers to reduce the probability of being identified as a crawler.
  2. Anti-crawling: Access the proxy IP pool and process verification codes (such as graphic verification codes, slider verification, using Selenium or Playwright).
  3. Incremental crawling: Record the last crawled page number or timestamp, and only crawl new/updated content to avoid re-scanning from the first page.
  4. Error retry: UsetenacityThe library adds an automatic retry mechanism to the request function, so that it no longer needs to be rerun manually when the network fluctuates.
  5. Data verification: usepydanticDefine the data model and automatically verify the type and format of each field before entering the database, so dirty data has nowhere to hide.

Tips: It is legal to crawl public data, but please be sure to obey the robots.txt rules of the target website, control the amount of concurrency, and do not put excessive pressure on the server. Do not crawl private or commercially sensitive data.