title: Efficient and practical MongoDB document storage description: Python3 crawler data storage: MongoDB operation guide
Python3 crawler data storage: MongoDB operation guide
In crawler projects, we often encounter unstructured or semi-structured web page data - nested comment lists, dynamically changing fields, huge differences in fields on different pages... At this time, the traditional relational-database seems a bit "constrained". As a representative of document databases, MongoDB is almost born for crawler data storage with its flexible schema design and JSON-like data structure.
This article will start from environment-setup and lead you to gradually master the core skills of operating MongoDB in Python3, including basic CRUD, advanced queries, index deduplication and aggregation statistics. It will also give best practices combined with crawler scenarios so that you can operate with ease in actual combat.
1. Introduction to NoSQL and MongoDB
Advantages of NoSQL databases
NoSQL (Not Only SQL) database is a database designed for large-scale, high concurrency, and flexible data models. For crawler scenarios, several of its features are very friendly:
- No strict mode: New fields captured can be written directly without modifying the table structure in advance;
- Support nested data: Complex structures such as lists and objects can be stored directly like JSON, perfectly corresponding to nested content in web pages;
- High performance reading and writing: The writing speed is extremely fast, suitable for scenarios where crawlers enter the database frequently.
NoSQL Common Categories
What is MongoDB?
MongoDB is an open source document database written in C++ that:
- Use BSON (binary JSON) to store data, naturally supporting nested documents and arrays;
- It has a distributed architecture and is easy to expand horizontally;
- The query syntax is very close to JavaScript objects and is also friendly to Python developers. It can be easily operated through PyMongo.
2. Environment preparation
Install MongoDB (recommend Docker one-click startup)
If there is no MongoDB service locally, using Docker is the fastest way, especially suitable for development and testing:
Install PyMongo driver
3. Basic connection operations
Connect to MongoDB
Select database and collection
The organizational structure of MongoDB is:客户端 → 数据库 → 集合 → 文档, equivalent to that in relational-database连接 → 库 → 表 → 行。
4. Core CRUD operations
Insert data: select any single item in batches
MongoDB automatically generates a_idThe field serves as the primary key, and you can also specify it yourself.
Query data: flexible filtering
Update data: local update is more efficient
Recommended$setMake partial updates to avoid overwriting the entire document.
Delete data
5. Advanced queries and practical skills
Commonly used comparison operators
Counting, sorting and paging
6. Index management: a powerful tool for crawler deduplication and speed-up
Indexes can greatly improve query speed, especially unique index, which is a common method used by crawlers to achieve URL deduplication.
7. Simple aggregation: data statistics
Aggregation Pipeline can perform multi-step processing of data. The following takes statistical label distribution as an example:
8. Best Practices
-
Reuse MongoClient instance PyMongo
MongoClientIt comes with a built-in connection pool. Do not create a new connection for each operation. Just create an instance globally in the crawler. -
Batch operations take priority Try to use
insert_many、update_manyor more flexiblebulk_write, reduce network round trip overhead. -
Use indexes appropriately Create indexes for commonly used queries, sorting, and grouping fields, but don’t abuse them—indexes take up storage space and slow down writes.
-
Error handling Catch PyMongo exceptions to prevent the crawler from being interrupted due to database errors.
-
Safe production environment Modify the default port, enable authentication, and do not expose the database directly to the public network.
9. Complete example
The following is a complete process for simulating a crawler to store article data:
10. Resource recommendation
By studying this article, you have mastered the core operations of PyMongo. Quickly apply these techniques to your crawler project and enjoy the efficiency and flexibility brought by MongoDB!

