Practical combat of reading and writing CSV files in Python

Practical practice of reading and writing CSV files in Python: "Universal Table Protocol" in the era of plain text

1. CSV file: lightweight cross-platform data transfer station

CSV (Comma Separated Values), the full name is "Comma Separated Values", is a plain text data format that has no official mandatory standard, but has wide consensus in the industry. In Daoman Python AI's actual projects, CSV is almost everywhere:

  • Temporary import/export middleware for databases
  • "Data Express" between front-end and back-end and cross-language systems
  • Fast storage and exchange of small and medium-sized machine learning data sets

Its biggest advantage is plain text: whether you use Windows Notepad, macOS Numbers, Linux Vim or Excel, you can directly open and edit it without installing special parsing tools.

The CSV structure convention that everyone follows by default:

  1. Plain text, coding-friendly: UTF-8 is commonly used (preferred for cross-platform), and sometimes GBK is also encountered (such as files exported by some domestic Excels)
  2. Each line is a complete record: For example, one line is "Guan Yu's three courses results"
  3. Clearly separated fields: English commas are used by default,, but tabs can also be used\t, vertical line|etc. as delimiters (especially useful when the data itself contains a large number of commas)
  4. Table header is optional, but highly recommended: The first line states the meaning of each column, so that both humans and machines can understand it at a glance
  5. Use brackets for special content: If a field contains delimiters, newlines, or quotation marks themselves, use double quotation marks."Wrap the entire field to avoid confusion in parsing

2. Use nativecsvModule writing to files: from simple data to custom formats

Python standard library comes withcsvModule, without pip to install any third-party package, can cover about 80% of daily CSV reading and writing needs. Let’s start with the most basic scenario: save the random scores of Guan Yu, Zhang Fei and other five Shu Han generals.scores.csv

Basic scenario: ordinary data + UTF-8 encoding

import csv
import random

# ✅ 必须养成的习惯:
# 1. open 时用 newline='',避免 Windows 平台产生多余空行
# 2. 明确指定 encoding='utf-8',防止中文乱码
with open('scores.csv', 'w', encoding='utf-8', newline='') as f:
    # 创建 writer 写入器
    writer = csv.writer(f)
    
    # 先写入表头,方便后续阅读和处理
    writer.writerow(['姓名', '语文', '数学', '英语'])
    
    # 生成模拟数据并逐行写入
    five_tigers = ['关羽', '张飞', '赵云', '马超', '黄忠']
    for name in five_tigers:
        # 列表推导式快速生成 3 门 50~100 的随机成绩
        scores = [random.randint(50, 101) for _ in range(3)]
        # 把姓名插到成绩列表最前面
        scores.insert(0, name)
        writer.writerow(scores)

✨ After running, try opening it in Excel? Remember to select the "UTF-8 comma separated" format when importing, and Chinese will be displayed normally.

Advanced scenario: Custom separator + bracket

In actual work, you may encounter some "non-mainstream" CSV files:

  • Use vertical lines|delimited (because the data naturally contains a lot of commas)
  • Force double quotes for all fields (a safer, unambiguous common practice)

At this time, just givecsv.writerJust pass in additional parameters:

import csv
import random

with open('custom_scores.csv', 'w', encoding='utf-8', newline='') as f:
    # 🎯 自定义参数:
    # delimiter='|' → 使用竖线作为字段分隔符
    # quoting=csv.QUOTE_ALL → 所有字段都加上双引号保护
    writer = csv.writer(
        f,
        delimiter='|',
        quoting=csv.QUOTE_ALL
    )
    
    writer.writerow(['姓名', '备注', '语文', '数学', '英语'])
    five_tigers = ['关羽', '张飞', '赵云', '马超', '黄忠']
    notes = ['桃园结义二弟', '桃园结义三弟', '长坂坡英雄', '锦马超', '百步穿杨']
    for name, note in zip(five_tigers, notes):
        scores = [random.randint(50, 101) for _ in range(3)]
        row = [name, note] + scores
        writer.writerow(row)

Looking at the generated file, you will find that each column uses|separated, and all fields are wrapped in double quotes, which is clear and safe.


3. Use nativecsvModule reading files: from line-by-line traversal to key-value pair access

When reading CSV, there are two most commonly used methods: list-based reading (simple and direct) and dictionary-based reading (strongly recommended, automatically associates headers with data).

Method 1: List-based reading (csv.reader

import csv

with open('scores.csv', 'r', encoding='utf-8') as f:
    # 创建 reader 迭代器(注意:迭代器只能完整遍历一次!)
    reader = csv.reader(f)
    for row in reader:
        # row 是一个字符串列表,无论原始数据是数字还是中文
        print(f"第 {reader.line_num} 行数据:", row)

If you don't want to rely on indexingrow[0]Take a name, rely onrow[1]Get Chinese language scores,DictReaderwould be a better choice - it automatically uses the table header as the key of the dictionary, and converts each row of data into a dictionary:

import csv

with open('scores.csv', 'r', encoding='utf-8') as f:
    # 创建 DictReader 迭代器
    reader = csv.DictReader(f)
    # 打印表头信息
    print("表头:", reader.fieldnames)
    print("-" * 40)
    for row in reader:
        # 直接通过键名取数据,语义清晰、不易出错
        print(f"{row['姓名']}的成绩:语文{row['语文']},数学{row['数学']},英语{row['英语']}")

📌 Tip: If you are reading the vertical bar-delimited file defined previously, remember to createreaderorDictReaderThe same is passed in whendelimiter='|', otherwise the analysis will fail!


4. Practical pit avoidance guide (a must-read for novices!)

csvThe module seems simple, but in actual use there are still several high-frequency pitfalls. Understanding it in advance can save a lot of debugging time.

Pit 1: Chinese garbled characters

  • Cause: Inconsistent encoding. For example, writing in GBK but reading in UTF-8, or vice versa.
  • SOLVED: When writing and reading, always explicitly specify the sameencoding. Domestic Excel uses GBK by default. In other scenarios, it is recommended to use UTF-8 first.

Pit 2: Extra blank lines on Windows platform

  • Cause: Windows default newline character is\r\n,andcsvThe module will automatically handle line breaks, no needopenThe function's own end-of-line conversion.
  • SOLVED: When writing to a file,open()must be addednewline=''(It is usually added when reading to maintain consistency).

Pit 3: The iterator can only be traversed once

  • reason:csv.readerandcsv.DictReaderWhat is returned is an iterator and repeated traversal is not supported.
  • Solution: If you need to use data multiple times, you can uselist(reader)Convert everything into a list. But be aware that large files (more than hundreds of thousands of lines) are best processed line by line to avoid memory explosion.

5. Summary and Prospect: NativecsvIt's the basics, Pandas is the advanced weapon

NativecsvModule applicable scenarios

  • Fast reading and writing of small and medium-sized CSV files (within tens of thousands of lines)
  • Scenarios that require highly customized formats such as separators and brackets
  • Lightweight projects that do not want to introduce third-party dependencies

The power of advanced artifact Pandas

In the large-scale data analysis and machine learning scenarios involved in Daoman Python AI, we almost no longer write nativecsv, instead use Pandas directly:

  1. Read and write CSV with one line of code:pd.read_csv()anddf.to_csv(), concise and clear
  2. Automatic conversion to DataFrame: A two-dimensional table structure similar to Excel that supports advanced operations such as filtering, sorting, grouping aggregation, and missing value processing. One line of API can replace dozens of lines of native code.
  3. Significant performance advantages: For files with hundreds of thousands or even millions of lines, Pandas reads faster and has better memory management.

In the next article, we will officially enter the world of Pandas reading and writing CSV and experience "nuclear weapon" level data processing efficiency. Stay tuned!