Python Serialization and Deserialization Guide

1. Serialization overview

When the program is running, all variables—whether they are simple dictionaries, lists, or complex custom class instances—are temporarily stored in the memory stack. Once the program ends, the operating system will immediately reclaim this memory, and the data will disappear.

But in actual development, we often need to "retain data":

  • Save the game progress as an archive and play again next time;
  • Cache the data captured by the crawler locally to avoid repeated requests;
  • Transfer structured information through API between front-end and back-end and microservices...

At this time, it is necessary to convert those "living, three-dimensional objects" in the memory into a format that can be stored (such as writing to files, databases) or transmitted (such as sending through the network). This process is called serialization. (In Python's binary-specific tools, this process is also vividly called pickling - "pickling" the data and saving it.)

In turn, restoring these flattened data into objects that can be directly manipulated in memory is deserialization (unpickling, unpickling).

Next, let’s take a look at the two most commonly used serialization schemes in Python.


2. Python pickle module

pickleIt is the simplest and most direct binary serialization tool that comes with Python. It handles almost all of Python's built-in types, even custom class instances with methods.

2.1 Basic usage

There are only four core APIs, which are very easy to remember:

  • dumps(obj): Serialize the object tobytesA binary string of type;
  • dump(obj, file): Directly serialize and write file objects;
  • loads(bytes):Deserialize back to object from binary string;
  • load(file):Deserialize from file object back to object.
import pickle

# 一个混合了多种类型的字典,pickle 完全可以处理
game_save = {
    "player": "小明",
    "level": 15,
    "inventory": ["铁剑", "生命药水*3", "金币120"],
    "hp": 88.5,
    "is_alive": True
}

# 1. 序列化为内存中的二进制串
serialized_bytes = pickle.dumps(game_save)
print(f"序列化后的二进制串(前 20 字节):{serialized_bytes[:20]}")

# 2. 直接写入本地文件(务必使用二进制写入模式 'wb')
with open("game_save.pkl", "wb") as f:
    pickle.dump(game_save, f)

# 3. 从本地文件反序列化(二进制读取模式 'rb')
with open("game_save.pkl", "rb") as f:
    loaded_save = pickle.load(f)

print(f"加载后的玩家信息:{loaded_save['player']},等级 {loaded_save['level']}")

2.2 Limitations of pickle

pickleAlthough convenient, the applicable scenarios are very narrow and have three main flaws:

core limit warning
  1. Absolutely exclusive to Python The binary data generated by pickle is completely incomprehensible to other languages ​​(Java, Go, JavaScript, etc.), so it can only be used for data exchange within the Python environment.

  2. Poor version compatibility Pickle files generated between different Python major versions (such as 2.x and 3.x) or even minor versions (such as 3.8 and 3.12) are likely to be incompatible, and old archives may not be read properly after upgrading the interpreter.

  3. High-risk security vulnerabilities **Never deserialize pickle data from untrusted sources! ** The restore process of pickle is essentially executing a piece of Python bytecode. A maliciously constructed pickle file can directly run any system command to delete your important files, steal privacy, and even control your computer. :::


3. JSON serialization

JSON (JavaScript Object Notation) is currently the most common cross-language text serialization format. Not only are Python, JavaScript, Java, Go and other mainstream languages ​​​​supported natively, but JSON itself is plain text and is very clear for humans to read.

3.1 Data type correspondence table

JSON is a "lightweight" format that supports only six basic types. Python's standard libraryjsonType mapping is performed automatically during serialization and deserialization:

JSON typePython type
{}(object)dict
[](array)list / tuple(Defaults to list after deserialization)
"string"(string)str
1234.56(number)intorfloat
true / falseTrue / False
nullNone

3.2 Basic usage

jsonModule API design andpickleVery similar, there are still four core methods, but the object processed is UTF-8 text or string in text form:

import json

# 一个符合 JSON 规则的 Python 字典
api_data = {
    "code": 200,
    "msg": "success",
    "data": {
        "user_id": 10086,
        "username": "Alice",
        "favorites": ["Python", "读书", "旅行"]
    }
}

# 1. 序列化为 JSON 字符串(文本)
json_str = json.dumps(api_data)
print(f"序列化后的 JSON 文本:{json_str}")

# 2. 格式化输出 —— indent 参数指定缩进空格数,可读性更高
pretty_json = json.dumps(api_data, indent=2)
print(f"格式化后的 JSON:\n{pretty_json}")

# 3. 直接写入本地文件(使用文本写入模式 'w',默认 UTF-8)
with open("api_response.json", "w", encoding="utf-8") as f:
    json.dump(api_data, f, indent=2, ensure_ascii=False)  # ensure_ascii 稍后解释

# 4. 从 JSON 字符串反序列化回 Python 对象
loaded_data = json.loads(json_str)
print(f"加载后的用户 ID:{loaded_data['data']['user_id']}")

# 5. 从本地文件反序列化
with open("api_response.json", "r", encoding="utf-8") as f:
    loaded_from_file = json.load(f)

3.3 Handle Chinese characters well (a practical tip)

::: tip Chinese display optimization By default,json.dumpswill escape non-ASCII characters (such as Chinese) into\uXXXXform. For the program, the front and back ends can parse it normally, but it is very unfriendly for human eyes to read. Just addensure_ascii=False, you can retain the native Chinese characters. At the same time, be sure to remember to specify it explicitly when reading and writing files.encoding="utf-8", to avoid garbled characters.

chinese_data = {"name": "小红", "address": "北京市朝阳区"}

# 默认行为:中文被转义
print(json.dumps(chinese_data))
# 输出:{"name": "\u5c0f\u7ea2", "address": "\u5317\u4eac\u5e02\u671d\u9633\u533a"}

# 保留中文阅读版
print(json.dumps(chinese_data, ensure_ascii=False))
# 输出:{"name": "小红", "address": "北京市朝阳区"}

4. Serialize custom objects

jsonBy default, the module cannot directly handle instances of custom classes, and we need to provide our own "object → dictionary" conversion logic. There are two commonly used methods.

4.1 Simple method: use directly__dict__

If your class is just a pure data container with no private attributes and no complex inheritance relationships, you can directly use the Python object that comes with it.__dict__Properties - It automatically packs all the public properties of the instance into a dictionary.

class SimpleStudent:
    def __init__(self, name, age, score):
        self.name = name
        self.age = age
        self.score = score

s1 = SimpleStudent("Bob", 20, 88)

# 序列化:通过 default 参数指定转换函数
simple_json = json.dumps(s1, default=lambda obj: obj.__dict__, ensure_ascii=False)
print(simple_json)  # {"name": "Bob", "age": 20, "score": 88}

# 反序列化:通过 object_hook 参数指定还原函数
def dict_to_simple_student(d):
    return SimpleStudent(d["name"], d["age"], d["score"])

loaded_s1 = json.loads(simple_json, object_hook=dict_to_simple_student)
print(f"加载后的学生:{loaded_s1.name},分数 {loaded_s1.score}")

4.2 A more flexible method: achieving exclusiveto_dictandfrom_dict

When your class has private attributes, attributes inherited from the parent class, or you want to mark the class information during serialization to facilitate global deserialization, it is more recommended to specifically implement the conversion method inside the class.

class FlexibleStudent:
    def __init__(self, name, age, score):
        self.name = name
        self.age = age
        self.__secret = "我的梦想是当程序员"  # 私有属性,__dict__ 不会直接暴露
    
    def to_dict(self):
        # 手动构建需要序列化的字典,并可加入 __class__ 标记元信息
        return {
            "__class__": "FlexibleStudent",
            "name": self.name,
            "age": self.age,
            "score": self.score,
            # 如果需要,也可以手动暴露部分私有属性
            # "secret": self.__secret
        }
    
    @classmethod
    def from_dict(cls, d):
        if d.get("__class__") == "FlexibleStudent":
            return cls(d["name"], d["age"], d["score"])
        # 如果不是目标类,直接返回字典本身,避免影响其他数据的还原
        return d

s2 = FlexibleStudent("小红", 19, 92)

# 序列化
flexible_json = json.dumps(s2, default=lambda obj: obj.to_dict(), ensure_ascii=False)
print(flexible_json)

# 反序列化 —— 全局只需要调用这个类方法即可
loaded_s2 = json.loads(flexible_json, object_hook=FlexibleStudent.from_dict)
print(f"加载后的学生类型:{type(loaded_s2)}")  # <class '__main__.FlexibleStudent'>

5. Security and Best Practices

No matter which serialization scheme you choose, keep the following points in mind:

  1. pickle never touch untrusted data This cannot be emphasized enough. Only use pickle in scripts, local caches, or internal pipelines that you have complete control over.

  2. Perform strict structure verification on JSON input The deserialized data is likely to have missing fields or unexpected types. You need to check manually, or usepydanticWait for the verification library to ensure the reliability of the data structure.

  3. Always do exception-handling Whether it's a file that doesn't exist, insufficient permissions, or a malformed JSON, it can happen at any time. Example:

import json

malformed_json = '{"name": "Bob", age: 20}'  # 错误:键 age 缺少双引号

try:
    data = json.loads(malformed_json)
except json.JSONDecodeError as e:
    print(f"JSON 解析失败:错误位置 {e.pos},错误信息:{e.msg}")
except FileNotFoundError:
    print("文件不存在")

6. Performance optimization (big data scenario)

When the amount of data processed reaches the GB level, or JSON is used frequently in high-concurrency network requests, the standard libraryjsonMay become a performance bottleneck. At this point you may wish to consider the following alternatives.

6.1 Faster JSON library

  • orjson
    Currently the fastest JSON library in the Python ecosystem, installation method:pip install orjson
    What it returns isbytesinstead ofstr, and can be automatically serializeddatetimeUUIDand other common types.

  • ujson
    It is also much faster than the standard library and has slightly better compatibility thanorjson, installation method:pip install ujson

import orjson

big_data = [{"id": i, "value": f"data_{i}"} for i in range(100_000)]

# 序列化后得到 bytes,可直接写入文件或通过网络发送
orjson_bytes = orjson.dumps(big_data)

# 反序列化
loaded_big_data = orjson.loads(orjson_bytes)

6.2 Binary cross-language format

If you have higher requirements for parsing speed and data compression rate, you can abandon plain text JSON and use binary format instead:

  • MessagePack
    Similar structure to JSON, but smaller and faster. Installation method:pip install msgpack

  • Protocol Buffers(protobuf)
    The structured binary serialization format produced by Google has the strongest performance and highest compression rate, but it needs to be written in advance..protofile to define the data structure, the cost of getting started is slightly higher.


7. Summary

When choosing a serialization solution, make trade-offs based on your core needs:

SolutionApplicable scenariosAdvantagesDisadvantages
picklePython internal temporary data storage, local cacheSupports almost all Python types, easy to useNot cross-language, unsafe, and poor version compatibility
StandardjsonFront-end/cross-language API, small data storageCross-language, safe, human-readableAverage performance, only supports basic types
orjson / ujsonHigh-frequency JSON processing, large-scale data storage or transmissionCross-language, safe, extremely fastSlightly lower compatibility,orjsonreturnbytes
Protocol BuffersCross-language, strict structure, ultra-high frequency data exchangeHighest performance, highest compression rateRequires definition.proto, high learning costs

Finally, I would like to emphasize again: **Safety first, pickle only believes in yourself! **