IO programming
Imagine that when developing a crawler, crawling hundreds of pages becomes stuck; or processing log files of several GB, directly running out of memory and crashing - 90% of these pitfalls encountered in the early stages of back-end development, the core is the IO bottleneck**. This article takes you from basic concepts to practical Python practice, and understands the underlying logic and practical skills of IO programming.
🔍 1. Basic concepts of IO
IO (Input/Output) is input and output, which is essentially the "data moving" between CPU/memory and external low-speed devices (disks, network cards, keyboards, etc.). Because the CPU processing speed is more than a million times that of peripherals, the CPU can only "fish" (or handle other tasks, depending on the mode) during the move, so IO is the core limiting point of program performance.
1.1 The most intuitive IO classification
- Input: Data is "moved" from peripherals into memory (such as reading local files, receiving HTTP requests)
- Output: "Move" data from memory to peripherals (such as saving pictures, sending WeChat messages)
1.2 Core abstraction of stream
Streaming is the most classic design in IO operations - think of data as a continuously flowing water pipe. You don't need to care whether the data source is a disk or a network. It is processed uniformly using the "read/write stream" interface:
- Input stream: One end of the water pipe is connected to the peripheral and the other end is connected to the program. It can only read but not write.
- Output stream: One end of the water pipe is connected to the program and the other end is connected to the peripheral. It can only be written but not read.
⚡ 2. IO processing mode
This is the core watershed in IO programming, which determines whether the program is "catching fish and waiting for results" or "doing multiple things at the same time."
2.1 Synchronous IO (blocking IO)
Synchronous IO is the easiest mode to understand: after the program initiates an IO request, it must stop and wait for the data to be read/written before the next line of code can be executed.
Features:
- The programming model is simple, logically linear, and novice-friendly
- Unable to utilize the "fishing time" of the CPU, resulting in poor performance in IO-intensive scenarios
- Suitable for single-threaded simple scripts and tool programs with minimal IO operations
2.2 Asynchronous IO (non-blocking IO)
The idea of asynchronous IO is: after the program initiates an IO request, it immediately performs other tasks, waits for the peripheral to prepare the data, and then comes back to process the results through a "callback function" or "event notification".
⚠️ Note: Python built-in
openDoes not support directawait, need to cooperate with third-party libraryaiofiles,implementpip install aiofilesJust install it.
Features:
- Making full use of the CPU's fishing time, the performance of a large number of IO-intensive scenes is significantly improved.
- The programming model is complex and requires understanding of concepts such as "coroutines", "event loops" and "callbacks"
- Suitable for web crawlers, high-concurrency servers, batch processing of multiple files, etc.
📝 3. IO operations in Python
Python has a complete built-in IO tool library, covering everything from basic file reading and writing to memory simulation.
3.1 Local file IO
Python uses built-inopen()Function processing files, withwithStatements (discussed below) automatically manage resources.
Basic reading and writing examples
File Mode Cheat Sheet
✨ Tips: The combination is more flexible, such as
r+tRead and write in text mode (no line wrapping),a+bRead appended in binary.
3.2 Memory IO (temporary data storage)
Memory IO reads and writes data in the memory buffer without actually operating the disk. It is extremely fast and suitable for scenarios such as temporary formatting and unit test simulation files.
3.3 Basic network IO
Python built-inurllibHandles simple HTTP requests without installing third-party libraries.
🚀 4. Advanced IO technology
Mastering these technologies can further optimize IO performance and code robustness.
4.1 Context Manager (automatically release resources)
PythonwithThe statement is a "resource management artifact". Even if the code throws an exception, it can automatically close the file/release the handle to avoid memory leaks.
4.2 Buffered IO (reduce system calls)
System calls (letting the kernel help read and write disk/network) are very time-consuming. Python enables buffered IO by default. It first temporarily stores the data in the memory buffer, saves a batch and then calls the system.
⚠️ Note:
buffering=0(Unbuffered) Only available in binary mode, an error will be reported in text mode.
4.3 Memory mapped files (random access to large files)
When processing large files of tens of GB, ordinary "read all memory" will crash, and "line-by-line/block-read" sequential access is inefficient - memory-mapped files can map part (or all) of the file to the memory address space, accessing the file randomly like an array, without the need for frequent system calls.
📌 Note: Memory mapped files depend on operating system support, and the mapping size cannot exceed the available memory (only a part of the file can be mapped).
💡 5. Modern IO programming practice
5.1 Use pathlib instead of os.path
Introduced in Python 3.4pathlibIt is an object-oriented file path operation library, which is better than the traditionalos.pathString concatenation is simpler, safer, and has better cross-platform compatibility.
5.2 High concurrent file/network processing
In addition to asynchronous IO, Python also provides thread pool/process pool to handle concurrent IO tasks.
🌟 Tips: Use thread pool for IO-intensive tasks (the cost of process switching is much greater than threads), and use process pool for CPU-intensive tasks.
🎯 6. Performance optimization suggestions and frequently asked questions
6.1 Practical performance optimization suggestions
- Batch operation (single read and write ≥1MB has significant effect): Reduce the number of system calls
- Set the buffer reasonably: 4096 is the default page size of the operating system and usually does not need to be adjusted.
- Prioritize asynchronous/thread pool: Do not use pure single thread in IO intensive scenarios
- Avoid small files: Merge small files such as logs and pictures to reduce the overhead of opening/closing handles
- Use memory mapping to process large file random access: 10-100 times faster than line-by-line/blocked sequential reading
6.2 High-frequency pitfalls and solutions
❌ Problem 1: Processing large files consumes memory
✅ Solution: read line by line/chunked
❌ Problem 2: Sudden power outage/crash after file writing, data loss
✅ Solution: Forced disk flush (only used in scenarios with extremely high data consistency requirements, which will significantly reduce performance)
📌 Summary
IO programming is one of the links closest to hardware resources in programming. Choosing the right strategy can not only solve program lags, but also make full use of the machine's performance potential.
- Use synchronous IO+pathlib in simple scenarios
- A large number of IO scenarios use asynchronous IO (aiofiles/aiohttp) + coroutine or thread pool
- Random access to large files using memory mapped files
- Always use a context manager to manage resources
Hurry up and find a small project to practice on! For example, use aiohttp to write a simple image crawler, or use pathlib to organize your download folder~

