IO programming

Imagine that when developing a crawler, crawling hundreds of pages becomes stuck; or processing log files of several GB, directly running out of memory and crashing - 90% of these pitfalls encountered in the early stages of back-end development, the core is the IO bottleneck**. This article takes you from basic concepts to practical Python practice, and understands the underlying logic and practical skills of IO programming.


🔍 1. Basic concepts of IO

IO (Input/Output) is input and output, which is essentially the "data moving" between CPU/memory and external low-speed devices (disks, network cards, keyboards, etc.). Because the CPU processing speed is more than a million times that of peripherals, the CPU can only "fish" (or handle other tasks, depending on the mode) during the move, so IO is the core limiting point of program performance.

1.1 The most intuitive IO classification

  • Input: Data is "moved" from peripherals into memory (such as reading local files, receiving HTTP requests)
  • Output: "Move" data from memory to peripherals (such as saving pictures, sending WeChat messages)

1.2 Core abstraction of stream

Streaming is the most classic design in IO operations - think of data as a continuously flowing water pipe. You don't need to care whether the data source is a disk or a network. It is processed uniformly using the "read/write stream" interface:

  • Input stream: One end of the water pipe is connected to the peripheral and the other end is connected to the program. It can only read but not write.
  • Output stream: One end of the water pipe is connected to the program and the other end is connected to the peripheral. It can only be written but not read.

⚡ 2. IO processing mode

This is the core watershed in IO programming, which determines whether the program is "catching fish and waiting for results" or "doing multiple things at the same time."

2.1 Synchronous IO (blocking IO)

Synchronous IO is the easiest mode to understand: after the program initiates an IO request, it must stop and wait for the data to be read/written before the next line of code can be executed.

# 同步IO示例 - 本地文件读取
with open('example.txt', 'r', encoding='utf-8') as f:
    content = f.read()  # 程序会阻塞在这里,直到文件完全加载到内存
    print(content[:100])  # 等上面读完才能打印

Features:

  • The programming model is simple, logically linear, and novice-friendly
  • Unable to utilize the "fishing time" of the CPU, resulting in poor performance in IO-intensive scenarios
  • Suitable for single-threaded simple scripts and tool programs with minimal IO operations

2.2 Asynchronous IO (non-blocking IO)

The idea of ​​​​asynchronous IO is: after the program initiates an IO request, it immediately performs other tasks, waits for the peripheral to prepare the data, and then comes back to process the results through a "callback function" or "event notification".

⚠️ Note: Python built-inopenDoes not support directawait, need to cooperate with third-party libraryaiofiles,implementpip install aiofilesJust install it.

# 异步IO示例 - 正确的aiofiles文件读取
import asyncio
import aiofiles

async def read_large_file_async():
    async with aiofiles.open('example.txt', 'r', encoding='utf-8') as f:
        content = await f.read()  # 不会阻塞整个事件循环,CPU可以去执行其他协程
        print(content[:100])

asyncio.run(read_large_file_async())

Features:

  • Making full use of the CPU's fishing time, the performance of a large number of IO-intensive scenes is significantly improved.
  • The programming model is complex and requires understanding of concepts such as "coroutines", "event loops" and "callbacks"
  • Suitable for web crawlers, high-concurrency servers, batch processing of multiple files, etc.

📝 3. IO operations in Python

Python has a complete built-in IO tool library, covering everything from basic file reading and writing to memory simulation.

3.1 Local file IO

Python uses built-inopen()Function processing files, withwithStatements (discussed below) automatically manage resources.

Basic reading and writing examples

# 写入文本文件
with open('output.txt', 'w', encoding='utf-8') as f:
    f.write('Hello, IO Programming!\n')
    f.writelines(['Line 2\n', 'Line 3\n'])  # 批量写入列表

# 读取全部文本
with open('output.txt', 'r', encoding='utf-8') as f:
    all_content = f.read()
    print(all_content)

File Mode Cheat Sheet

SchemaDescription
rRead-only (default, error if file does not exist)
wWrite only (overwrite existing file, create if it does not exist)
xExclusive creation (file already exists, error reported, suitable to avoid repeated writing)
aAppend writing (write to the end of the file, create it if it does not exist)
bBinary mode (used with other modes, such asrbread pictures,wbSave compressed package)
tText mode (default, automatically handles encoding issues such as line breaks)
+Update mode (used in conjunction with other modes, such asr+bRead and write in binary)

Tips: The combination is more flexible, such asr+tRead and write in text mode (no line wrapping),a+bRead appended in binary.

3.2 Memory IO (temporary data storage)

Memory IO reads and writes data in the memory buffer without actually operating the disk. It is extremely fast and suitable for scenarios such as temporary formatting and unit test simulation files.

from io import StringIO, BytesIO

# 字符串IO:专门处理文本
string_buf = StringIO()
string_buf.write('Hello')
string_buf.write(' Memory IO!')
print(string_buf.getvalue())  # 无需关闭就能获取全部内容 → Hello Memory IO!

# 字节IO:专门处理二进制数据(图片、压缩包片段)
bytes_buf = BytesIO()
bytes_buf.write(b'binary content')
print(bytes_buf.getvalue())  # → b'binary content'

3.3 Basic network IO

Python built-inurllibHandles simple HTTP requests without installing third-party libraries.

import urllib.request

# 同步获取网页HTML(前200个字符)
with urllib.request.urlopen('https://www.python.org') as resp:
    html = resp.read().decode('utf-8')
    print(html[:200])

🚀 4. Advanced IO technology

Mastering these technologies can further optimize IO performance and code robustness.

4.1 Context Manager (automatically release resources)

PythonwithThe statement is a "resource management artifact". Even if the code throws an exception, it can automatically close the file/release the handle to avoid memory leaks.

try:
    # 不用with的写法(容易忘写close,异常时也不会释放)
    f = open('risky.txt', 'r')
    data = f.read()
finally:
    if f:
        f.close()

# 用with的写法(简洁、安全)
with open('safe.txt', 'r') as f:
    data = f.read()

4.2 Buffered IO (reduce system calls)

System calls (letting the kernel help read and write disk/network) are very time-consuming. Python enables buffered IO by default. It first temporarily stores the data in the memory buffer, saves a batch and then calls the system.

⚠️ Note:buffering=0(Unbuffered) Only available in binary mode, an error will be reported in text mode.

# 无缓冲(二进制模式专用,数据立即写入)
with open('no_buffer.bin', 'wb', buffering=0) as f:
    f.write(b'urgent data')

# 行缓冲(文本模式专用,遇到换行符立即写入)
with open('line_buffer.txt', 'w', encoding='utf-8', buffering=1) as f:
    f.write('line 1 will flush immediately\n')
    f.write('line 2 will wait for next newline')

# 指定缓冲区大小(4096是默认页大小,通常性能最优)
with open('custom_buffer.txt', 'w', encoding='utf-8', buffering=4096) as f:
    f.write('data will flush when buffer is full')

4.3 Memory mapped files (random access to large files)

When processing large files of tens of GB, ordinary "read all memory" will crash, and "line-by-line/block-read" sequential access is inefficient - memory-mapped files can map part (or all) of the file to the memory address space, accessing the file randomly like an array, without the need for frequent system calls.

📌 Note: Memory mapped files depend on operating system support, and the mapping size cannot exceed the available memory (only a part of the file can be mapped).

import mmap

with open('large_video.mp4', 'r+b') as f:
    # 映射整个文件(0表示全部)
    mm = mmap.mmap(f.fileno(), 0)
    # 像操作数组一样随机读取第100-200字节的内容
    print(mm[100:200])
    # 修改内存中的内容会同步到磁盘
    mm[0:5] = b'HELLO'
    mm.close()

💡 5. Modern IO programming practice

5.1 Use pathlib instead of os.path

Introduced in Python 3.4pathlibIt is an object-oriented file path operation library, which is better than the traditionalos.pathString concatenation is simpler, safer, and has better cross-platform compatibility.

from pathlib import Path

# 1. 路径拼接(自动处理Windows/Linux的路径分隔符)
data_path = Path('.') / 'data' / 'subdir' / 'input.txt'

# 2. 目录操作
data_path.parent.mkdir(parents=True, exist_ok=True)  # 自动创建父目录,存在不报错

# 3. 文件读写
data_path.write_text('Hello, pathlib!', encoding='utf-8')
content = data_path.read_text(encoding='utf-8')

# 4. 遍历当前目录下的所有txt文件
txt_files = list(Path('.').glob('*.txt'))

5.2 High concurrent file/network processing

In addition to asynchronous IO, Python also provides thread pool/process pool to handle concurrent IO tasks.

🌟 Tips: Use thread pool for IO-intensive tasks (the cost of process switching is much greater than threads), and use process pool for CPU-intensive tasks.

from concurrent.futures import ThreadPoolExecutor
from pathlib import Path

# 统计当前目录下所有txt文件的行数
def count_lines(file_path: Path) -> int:
    with open(file_path, 'r', encoding='utf-8') as f:
        return sum(1 for _ in f)  # 生成器逐行计数,不占内存

# 使用线程池并发处理
txt_files = list(Path('.').glob('*.txt'))
with ThreadPoolExecutor(max_workers=8) as executor:
    # map会按文件顺序返回结果
    line_counts = executor.map(count_lines, txt_files)

for file, count in zip(txt_files, line_counts):
    print(f"{file.name}: {count} lines")

🎯 6. Performance optimization suggestions and frequently asked questions

6.1 Practical performance optimization suggestions

  1. Batch operation (single read and write ≥1MB has significant effect): Reduce the number of system calls
  2. Set the buffer reasonably: 4096 is the default page size of the operating system and usually does not need to be adjusted.
  3. Prioritize asynchronous/thread pool: Do not use pure single thread in IO intensive scenarios
  4. Avoid small files: Merge small files such as logs and pictures to reduce the overhead of opening/closing handles
  5. Use memory mapping to process large file random access: 10-100 times faster than line-by-line/blocked sequential reading

6.2 High-frequency pitfalls and solutions

❌ Problem 1: Processing large files consumes memory

✅ Solution: read line by line/chunked

# 逐行读取(适合行长度均匀的文本文件)
with open('large_log.txt', 'r', encoding='utf-8') as f:
    for line in f:
        process_line(line)

# 分块读取(适合行长度极不均匀的二进制/文本文件)
CHUNK_SIZE = 4 * 1024 * 1024  # 4MB块
with open('large_video.mp4', 'rb') as f:
    while chunk := f.read(CHUNK_SIZE):  # Python 3.8+海象运算符
        process_chunk(chunk)

❌ Problem 2: Sudden power outage/crash after file writing, data loss

✅ Solution: Forced disk flush (only used in scenarios with extremely high data consistency requirements, which will significantly reduce performance)

import os

with open('critical_data.txt', 'w', encoding='utf-8') as f:
    f.write('very important data')
    f.flush()  # 把数据从Python缓冲区刷到操作系统缓冲区
    os.fsync(f.fileno())  # 强制操作系统把数据刷到物理磁盘

📌 Summary

IO programming is one of the links closest to hardware resources in programming. Choosing the right strategy can not only solve program lags, but also make full use of the machine's performance potential.

  • Use synchronous IO+pathlib in simple scenarios
  • A large number of IO scenarios use asynchronous IO (aiofiles/aiohttp) + coroutine or thread pool
  • Random access to large files using memory mapped files
  • Always use a context manager to manage resources

Hurry up and find a small project to practice on! For example, use aiohttp to write a simple image crawler, or use pathlib to organize your download folder~