title: Efficient and practical MongoDB document storage description: Python3 crawler data storage: MongoDB operation guide

Python3 crawler data storage: MongoDB operation guide

In crawler projects, we often encounter unstructured or semi-structured web page data - nested comment lists, dynamically changing fields, huge differences in fields on different pages... At this time, the traditional relational-database seems a bit "constrained". As a representative of document databases, MongoDB is almost born for crawler data storage with its flexible schema design and JSON-like data structure.

This article will start from environment-setup and lead you to gradually master the core skills of operating MongoDB in Python3, including basic CRUD, advanced queries, index deduplication and aggregation statistics. It will also give best practices combined with crawler scenarios so that you can operate with ease in actual combat.


1. Introduction to NoSQL and MongoDB

Advantages of NoSQL databases

NoSQL (Not Only SQL) database is a database designed for large-scale, high concurrency, and flexible data models. For crawler scenarios, several of its features are very friendly:

  • No strict mode: New fields captured can be written directly without modifying the table structure in advance;
  • Support nested data: Complex structures such as lists and objects can be stored directly like JSON, perfectly corresponding to nested content in web pages;
  • High performance reading and writing: The writing speed is extremely fast, suitable for scenarios where crawlers enter the database frequently.

NoSQL Common Categories

TypeCore FeaturesRepresentative Products
Key-value storageSimple and efficient, often used for cachingRedis
Document typeJSON-like format, the most flexibleMongoDB
Column storageSuitable for massive data analysisHBase
Graph databaseGood at handling complex relationshipsNeo4J

What is MongoDB?

MongoDB is an open source document database written in C++ that:

  • Use BSON (binary JSON) to store data, naturally supporting nested documents and arrays;
  • It has a distributed architecture and is easy to expand horizontally;
  • The query syntax is very close to JavaScript objects and is also friendly to Python developers. It can be easily operated through PyMongo.

2. Environment preparation

Install MongoDB (recommend Docker one-click startup)

If there is no MongoDB service locally, using Docker is the fastest way, especially suitable for development and testing:

docker run -d -p 27017:27017 --name mongodb mongo:latest

Install PyMongo driver

pip install pymongo

3. Basic connection operations

Connect to MongoDB

from pymongo import MongoClient

# 方式一:分别指定主机和端口
client = MongoClient('localhost', 27017)

# 方式二:使用连接字符串(推荐,方便后续添加认证信息)
client = MongoClient('mongodb://localhost:27017/')

Select database and collection

The organizational structure of MongoDB is:客户端 → 数据库 → 集合 → 文档, equivalent to that in relational-database连接 → 库 → 表 → 行

# 选择数据库(不存在会自动创建)
db = client['spider_data']   # 等同于 db = client.spider_data

# 选择集合(不存在也会自动创建)
collection = db['articles']  # 等同于 collection = db.articles

# 建议使用字典方式选择,避免字段名与 Python 关键字冲突

4. Core CRUD operations

Insert data: select any single item in batches

MongoDB automatically generates a_idThe field serves as the primary key, and you can also specify it yourself.

# 插入单条文档
article = {
    'title': 'MongoDB 爬虫入门',
    'url': 'https://example.com/mongodb-spider',
    'tags': ['Python', 'MongoDB', '爬虫'],
    'views': 100
}
result = collection.insert_one(article)
print(result.inserted_id)   # 输出自动生成的 _id

# 批量插入(性能更高,爬虫优先推荐使用)
articles = [
    {'title': 'PyMongo 基础', 'url': 'https://example.com/pymongo', 'tags': ['Python'], 'views': 50},
    {'title': '爬虫去重技巧', 'url': 'https://example.com/spider-dedup', 'tags': ['爬虫'], 'views': 80}
]
result = collection.insert_many(articles)
print(result.inserted_ids)

Query data: flexible filtering

# 查询单条(返回第一个匹配的文档)
result = collection.find_one({'title': 'PyMongo 基础'})
print(result)

# 查询多条(返回游标,需要遍历)
results = collection.find({'views': {'$gt': 60}})  # $gt 表示大于
for res in results:
    print(res)

# 使用正则查询(例如查询标题中含“爬虫”的文档)
results = collection.find({'title': {'$regex': '.*爬虫.*'}})

Update data: local update is more efficient

Recommended$setMake partial updates to avoid overwriting the entire document.

# 更新单条文档
result = collection.update_one(
    {'title': 'PyMongo 基础'},
    {'$set': {'views': 60}}   # 只更新 views 字段
)
print(f"匹配数:{result.matched_count},修改数:{result.modified_count}")

# 更新多条文档,使用 $inc 递增(例如所有文档的 views 增加 10)
result = collection.update_many(
    {},
    {'$inc': {'views': 10}}
)

Delete data

# 删除单条
result = collection.delete_one({'title': 'PyMongo 基础'})
print(result.deleted_count)

# 删除多条(清空集合请谨慎操作!)
# result = collection.delete_many({})

5. Advanced queries and practical skills

Commonly used comparison operators

OperatorMeaningExample
$ltless than{'views': {'$lt': 50}}
$gtgreater than{'views': {'$gt': 60}}
$lte / $gteLess than or equal to/Greater than or equal to{'views': {'$lte': 100}}
$neis not equal to{'title': {'$ne': 'test'}}
$in / $ninIn/out of range{'tags': {'$in': ['Python', '爬虫']}}

Counting, sorting and paging

import pymongo

# 统计符合条件的文档数
count = collection.count_documents({'views': {'$gt': 60}})
print(f"符合条件的文档数:{count}")

# 排序(pymongo.ASCENDING 升序,DESCENDING 降序)
results = collection.find().sort('views', pymongo.DESCENDING)

# 分页(注意:数据量大时 skip 性能较差,建议使用 _id 范围分页)
results = collection.find().skip(2).limit(2)

6. Index management: a powerful tool for crawler deduplication and speed-up

Indexes can greatly improve query speed, especially unique index, which is a common method used by crawlers to achieve URL deduplication.

import pymongo

# 创建唯一索引(例如根据 url 字段去重,避免重复爬取)
collection.create_index([('url', pymongo.ASCENDING)], unique=True)

# 查看所有索引
indexes = collection.list_indexes()
for idx in indexes:
    print(idx)

# 删除指定索引
collection.drop_index('url_1')

7. Simple aggregation: data statistics

Aggregation Pipeline can perform multi-step processing of data. The following takes statistical label distribution as an example:

pipeline = [
    {'$unwind': '$tags'},                   # 将 tags 数组拆分成多条文档
    {'$group': {'_id': '$tags', 'count': {'$sum': 1}}},  # 按标签分组计数
    {'$sort': {'count': -1}}                # 按数量降序排列
]
results = collection.aggregate(pipeline)
for res in results:
    print(res)

8. Best Practices

  1. Reuse MongoClient instance PyMongoMongoClientIt comes with a built-in connection pool. Do not create a new connection for each operation. Just create an instance globally in the crawler.

  2. Batch operations take priority Try to useinsert_manyupdate_manyor more flexiblebulk_write, reduce network round trip overhead.

  3. Use indexes appropriately Create indexes for commonly used queries, sorting, and grouping fields, but don’t abuse them—indexes take up storage space and slow down writes.

  4. Error handling Catch PyMongo exceptions to prevent the crawler from being interrupted due to database errors.

    from pymongo.errors import PyMongoError
    
    try:
        collection.insert_one(article)
    except PyMongoError as e:
        print(f"数据库操作失败:{e}")
  5. Safe production environment Modify the default port, enable authentication, and do not expose the database directly to the public network.


9. Complete example

The following is a complete process for simulating a crawler to store article data:

import pymongo
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError

# 1. 连接数据库
client = MongoClient('mongodb://localhost:27017/')
db = client['spider_demo']
articles = db['articles']

# 2. 创建唯一索引,防止重复爬取
articles.create_index([('url', pymongo.ASCENDING)], unique=True)

# 3. 模拟插入爬取到的数据
sample_articles = [
    {'title': 'MongoDB 与爬虫', 'url': 'https://demo.com/1', 'tags': ['MongoDB', '爬虫'], 'views': 120},
    {'title': 'Python 基础', 'url': 'https://demo.com/2', 'tags': ['Python'], 'views': 80},
    {'title': 'MongoDB 与爬虫', 'url': 'https://demo.com/1', 'tags': ['重复数据'], 'views': 0}  # 重复 url
]

for art in sample_articles:
    try:
        articles.insert_one(art)
        print(f"插入成功:{art['title']}")
    except DuplicateKeyError:
        print(f"重复数据跳过:{art['url']}")

# 4. 统计标签分布
print("\n标签分布:")
pipeline = [
    {'$unwind': '$tags'},
    {'$group': {'_id': '$tags', 'count': {'$sum': 1}}},
    {'$sort': {'count': -1}}
]
for res in articles.aggregate(pipeline):
    print(f"{res['_id']}: {res['count']}")

# 5. 清理演示数据
articles.delete_many({})
client.close()

10. Resource recommendation


By studying this article, you have mastered the core operations of PyMongo. Quickly apply these techniques to your crawler project and enjoy the efficiency and flexibility brought by MongoDB!