title: Python crawler practice: PyQuery + MongoDB guide description: Build a search engine using Elasticsearch
Python crawler practice: PyQuery + MongoDB guide
1. Preface
When you first get started with crawlers, the three most frustrating things are often:
- The list is stuck as soon as I turn the page, 10 pages of data can be run for half an hour and errors are reported frequently;
- If you run it repeatedly, there will be hundreds of duplicate movies in MongoDB, and the data will be too dirty to be used;
- I want to extract "release time" from HTML, but the resulting tag has neither id nor class. The positioning depends entirely on guessing, and the code will be completely useless after a revision.
This article will use a lightweight but covering industrial-level basic logic movie data crawler prototype to solve these three problems at once:
- Use PyQuery to quickly parse HTML like writing jQuery;
- Use MongoDB's upsert mechanism to automatically remove duplicates and update;
- Use Python's built-in multi-process pool to bypass the GIL and achieve page-level concurrent collection.
The entire code does not exceed 150 lines, but the routine can be directly reused in your own projects.
2. Preparation
Set up the environment first, don't wait until you are halfway through writing to check for errors.
Dependency library installation
MongoDB startup
After installing MongoDB locally, start the service through the command line or a visual tool (such as MongoDB Compass). The default listening port is27017, we will connect directly from behindmongodb://localhost:27017。
3. Complete code + module disassembly
The following is the complete code with super detailed Chinese comments, which is run through in the order of "Configuration → General Request → Index Page Processing → Detail Page Processing → Data Storage → Concurrent Scheduling". It is recommended to scan the whole thing first, and then focus on the disassembled modules later.
Code Reading Tips: The above code can be copied directly and run (provided that MongoDB is started and the target test station is accessible). Below we will select several key modules and conduct an in-depth analysis of their design ideas and common pitfalls.
4. Core technology highlights
This set of code seems simple, but each module hides the best practices for crawler development. Let’s pick the 3 most important ones to discuss in detail.
🎨 PyQuery:contains()selector
When parsing HTML, the biggest headache is that there is no id/class, and you can only rely on text content to locate elements (such as the "release" information in this example). The traditional approach is to count the tag order or write complex XPath. Once the page structure is fine-tuned, the code will be completely useless.
PyQuery directly moved jQuery:contains()Syntax, one line of code to solve the problem:
Usage Suggestions::contains()Not only can it be positioned, but it can also be used for final extraction with regular expressions, making it very suitable for processing semi-structured fields.
🛡️ MongoDB’s upsert deduplication/update
Commonly used by novicesinsert_one()There are two big pitfalls:
- If the collection has a unique index, an error will be reported if the same key is inserted repeatedly;
- If there is no unique index, duplicate data will be stuffed into it every time it is run, and the database will be full of garbage after a few days.
update_one(..., upsert=True)Then perfectly avoid:
-Use first{'name': data['name']}As a search condition (natural unique key);
- Execute when found
$setUpdate all fields (for example, if the rating changes, it will be automatically synchronized); - If not found, insert a new document.
Whether it is the first crawl or incremental update, the same function will always be used.
⚡ Multi-process pool scheduling, breaking through GIL limitations
Because of the existence of GIL in Python, only one CPU core is working at the same time in a single thread. But the crawler spends 90% of its time waiting for network response, which is I/O blocking, and the CPU is actually idle.
At this timemultiprocessing.Pool()Multi-process comes in handy:
- Each process has its own GIL, which can truly utilize multi-core CPUs;
- While process A is waiting for the list page to return, the CPU can switch to process B to parse the details page;
- While process B is waiting for the details page to return, process C can save data and initiate new requests.
Willprocess_pageThrow it into the process pool and usepool.map()By processing all page numbers in parallel, the total time consumption can be reduced to 1/4 of the original or even lower (depending on the response speed and anti-crawling strategy of the target website).
5. Extensible direction (advanced)
The above prototype is enough for basic data collection, but in an actual production environment, you can continue to add modules:
- Disguise identity: Maintain the User-Agent pool and randomly switch Referers to reduce the probability of being identified as a crawler.
- Anti-crawling: Access the proxy IP pool and process verification codes (such as graphic verification codes, slider verification, using Selenium or Playwright).
- Incremental crawling: Record the last crawled page number or timestamp, and only crawl new/updated content to avoid re-scanning from the first page.
- Error retry: Use
tenacityThe library adds an automatic retry mechanism to the request function, so that it no longer needs to be rerun manually when the network fluctuates. - Data verification: use
pydanticDefine the data model and automatically verify the type and format of each field before entering the database, so dirty data has nowhere to hide.
Tips: It is legal to crawl public data, but please be sure to obey the robots.txt rules of the target website, control the amount of concurrency, and do not put excessive pressure on the server. Do not crawl private or commercially sensitive data.

