Strings and encodings

1. Basics of character encoding

1.1 Why do computers need coding?

The essence of a computer is that it only knows0and1machine. Even if we knock it casuallyHi 中文😊, before being transmitted in memory, disk and network, it must be converted into numbers and then into binary. Early computers were born in the United States and were designed to use 8 bits to make up 1 byte (byte). One byte can represent the most0~255There are 256 values ​​in total, which is enough for everyday use in English.

1.2 ASCII: The first universal but “exclusive” standard

ASCII (American Standard Code for Information Interchange) was released in 1967, directly solving the problem of converting English to numbers:

  • use0~127These 128 numbers correspond to uppercase and lowercase English letters, Arabic numerals, punctuation marks, line feeds and other control characters.
  • For example: uppercaseA= 65, lowercasez= 122, space = 32.

The limitations are also obvious: **ASCII does not consider Chinese, Japanese, Korean and other non-Latin languages ​​at all. **

1.3 Multi-language melee: the root of garbled code

In order to allow their own languages ​​to be processed by computers, each country began to develop its own coding standards:

  • China: GB2312 (simplified characters, double bytes) → GBK (extended support for traditional Chinese) → GB18030;
  • Japan: Shift_JIS;
  • South Korea: EUC-KR.

The problem is: **The same binary number will be interpreted as completely different characters under different encoding standards. ** For example, the same segment of bytes may read "Hello" when opened with GBK, but may be garbled when opened with Shift_JIS - this is the source of the garbled characters.

1.4 Unicode: unified global "character ID card"

In 1991, the Unicode Alliance released the Unicode standard, which assigned a unique number to all text, symbols (even 😊 emoticons) in the world, called a code point (Code Point). Modern operating systems and programming languages ​​basically store text inside the memory based on Unicode or its variants.

1.5 UTF-8: The "King of Cost-Effectiveness" for Storage and Transmission

One implementation of Unicode, UTF-16, has a small disadvantage: even ASCII English letters take up 2 bytes, which wastes space for English-based text. So UTF-8 was born, which is a variable length encoding that perfectly balances compatibility and storage efficiency:

  • English letters/ASCII characters: 1 byte (exactly the same as ASCII);
  • Commonly used Kanji/Japanese Kana: 3 bytes;
  • Uncommon words/emoticons: 4~6 bytes.

Today, more than 99% of the world's web pages, software, and files use UTF-8 by default.


2. Coding process in actual work

Remember this core principle: "Use Unicode in memory, use UTF-8 when storing/transmitting"

  1. When opening a file/entering text: The system automatically decodes the UTF-8 byte stream into Unicode and stores it in memory for you to edit;
  2. When saving the file/sending the message: Then encode the Unicode in the memory to UTF-8, write it to the disk or transmit it through the network.

For example, in an HTML page<meta charset="UTF-8">, which tells the browser: "Please use UTF-8 to decode this content."


3. Python string processing (Python 3 exclusive)

Python 3 has completely optimized strings and encoding: The default string is Unicode, which perfectly solves the problem in Python 2str/unicodeMixed historical conundrum.

3.1 Native support for multiple languages

Just write it directly without any prefixes or escaping:

print('包含中文、日文「こんにちは」、表情😊的Python字符串')

3.2 Conversion between characters and Unicode code points

  • ord(字符): Get the Decimal Unicode code point of a single character;
  • chr(码点): Convert decimal code points back to characters.
ord('A')     # 65
ord('中')    # 20013
ord('😊')    # 128522
chr(66)      # 'B'
chr(25991)   # '文'
chr(128513)  # '😀'

3.3 Hexadecimal writing of code points

You can also write it directly in the string\u(4 digit hexadecimal) or\U(8 digit hexadecimal):

'\u4e2d\u6587'    # '中文'
'\U0001f600'      # '😀'

3.4 bytes type: specially used to store binary data

Unicode strings cannot be used directly for network transmission and disk reading and writing, and must be converted to binary stream (bytes type) first:

  • bytesbybThe prefix indicates that each element is0~255an integer;
  • When displayed, printable ASCII characters will be displayed directly, and non-printable ASCII characters will be displayed directly.\x+ Two-digit hexadecimal representation.
x = b'ABC'   # bytes类型,x[0] = 65,x[1] = 66,x[2] = 67

3.5 Core conversion methods of str and bytes

  • str.encode(编码): Unicode string → bytes (encoding);
  • bytes.decode(编码):bytes → Unicode string (decoded).

The default is UTF-8, but it is strongly recommended to specify it explicitly to avoid cross-environment garbled characters:

# 编码
'ABC'.encode('ascii')       # b'ABC'
'中文😊'.encode('utf-8')    # b'\xe4\xb8\xad\xe6\x96\x87\xf0\x9f\x98\x8a'

# 解码
b'ABC'.decode('ascii')                               # 'ABC'
b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8')          # '中文'

3.6 Graceful handling of decoding errors

If partially corrupted binary data is encountered, purestrictThe mode will throw an exception directly, you can useerrorsParameter control:

# 忽略错误字节
b'\xe4\xb8\xad\xff\xe6\x96\x87'.decode('utf-8', errors='ignore')   # '中文'

# 用 � 替换错误字节
b'\xe4\xb8\xad\xff\xe6\x96\x87'.decode('utf-8', errors='replace')  # '中�文'

3.7 When calculating length, distinguish between characters and bytes

  • len(str): Calculate number of characters;
  • len(bytes): Calculates number of bytes.
len('ABC')                       # 3(字符数)
len('中文😊')                     # 3(字符数)
len('中文😊'.encode('utf-8'))     # 3+3+4 = 10(字节数)

3.8 Coding specifications for Python source code

If your Python file contains non-ASCII characters such as Chinese and Japanese, you must do the following:

  1. The file is saved in UTF-8 format (this is the default for modern editors);
  2. Explicitly declare encoding in the file header (Python 3 defaults to UTF-8, but explicit declaration is more standardized).
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

print('这段中文在Python文件里没问题哦!')

4. Python string formatting (3 mainstream methods)

4.1 % Formatting: traditional and highly compatible

Python’s earliest formatting method uses%s%d%fetc placeholders:

'Hello, %s!' % 'world'                      # 'Hello, world!'
'Hi, %s, 你有 %d 元余额。' % ('小明', 1000000)   # 多个参数用元组
'%2d-%02d' % (3, 1)                         # ' 3-01'(补空格 / 补零)
'%.2f' % 3.1415926                           # '3.14'(保留两位小数)
'增长率: %d %%' % 7                           # '增长率: 7 %'(%% 输出百分号)

4.2 format() method: flexible and readable

Introduced in Python 2.7/3.0, use{}Placeholder, supports passing parameters by position or name:

# 按位置
'Hello, {0}, 成绩提升了 {1:.1f}%'.format('小明', 17.125)   # 'Hello, 小明, 成绩提升了 17.1%'

# 按名字
'Hello, {name}, 成绩提升了 {rate:.1f}%'.format(name='小明', rate=17.125)

Prepend stringf, write variable names or expressions directly in placeholders, and also support all formatting syntax:

name = '小明'
s1 = 72
s2 = 85
rate = (s2 - s1) / s1 * 100

f'Hello, {name}, 成绩提升了 {rate:.1f}%'  # 'Hello, 小明, 成绩提升了 18.1%'

5. Best practices for avoiding pitfalls

  1. Completely embrace UTF-8 Whether it is text files, code storage, network interfaces or databases, if you can choose UTF-8, you must choose UTF-8.

  2. Formatting is preferred f-string In Python 3.6+ environment, f-string is the optimal solution for simplicity, readability and performance.

  3. File I/O must explicitly specify encoding Do not rely on the system default encoding (GBK is commonly used in Windows, UTF-8 is commonly used in macOS/Linux), otherwise cross-platform garbled characters will easily occur.

    # 读取文件
    with open('test.txt', 'r', encoding='utf-8') as f:
        content = f.read()
    
    # 写入文件
    with open('test.txt', 'w', encoding='utf-8') as f:
        f.write('中文内容')
  4. Don’t mix str and bytes in Python 3 The two cannot be directly spliced ​​or compared, and must be explicitly converted before operation.


6. Small exercise

Use Python 3’s f-string to calculate Xiao Ming’s performance improvement rate:

s1 = 72  # 上次成绩
s2 = 85  # 本次成绩
rate = (s2 - s1) / s1 * 100
print(f'小明的成绩提升了 {rate:.1f}%')  # 输出:小明的成绩提升了 18.1%

7. Summary

  • Computers can only process binary, text must be converted into numbers through encoding;
  • Unicode is the "universal ID card" of characters, used in memory; UTF-8 is the most common transmission/storage encoding;
  • Python 3strThe default is Unicode, which handles multiple languages ​​perfectly;bytesSpecifically for binary scenarios;
  • For string formatting, f-string of Python 3.6+ is preferred;
  • Always explicitly specify UTF-8 encoding, say goodbye to garbled characters!