title: Simple TXT plain text file storage description: Python crawler data storage: TXT text storage guide

Python crawler data storage: TXT text storage guide

When you are a newbie practicing crawling, saving intermediate data for temporary projects, and don’t want to worry about dependencies, TXT is the “lightweight savior” you can’t escape - but how to save it quickly and clearly without losing data? This article will tell you everything.

1. Why choose TXT?

Don’t underestimate this format that existed in the last century. It has three irreplaceable core advantages:

Zero threshold to get started: No need to install any libraries (json, csv, pandas are not required), Python comes with itopen()You can work directly.
All platforms available: Windows Notepad, macOS Text Editor, LinuxcatCommand, you can open it and look at it at will, without any environment dependence.
Absolutely lightweight and non-redundant: no schema, no headers, no encoding locks - as long as you remember to forcibly specify UTF-8, there will be no garbled characters.

Of course, the shortcomings are also there:

It is not possible to quickly filter by conditions such as "Rating > 9" and "Science Fiction Movies".
It is difficult to automatically verify data integrity (for example, if the release time is missed, it cannot be seen in the file at all).
There is a lag when opening large files, and the modification efficiency is extremely low.

Suitable scene

✅ Practice crawlers and verify crawling results ✅ Short-term temporary storage of "transitional data" before JSON/CSV ✅ Simple debugging logs and monitoring records ❌ Requires structured retrieval and statistics ❌ Long-term persistence of core data ❌ Massive data processing exceeding one million levels

2. Practical example: Crawl the sample movie website and save TXT

Let’s take the public crawler practice website https://ssr1.scrape.center/ as a demonstration. Crawl the names, genres, release times, and ratings of the 10 movies on the homepage, and then generate an **easy-to-read, delimited, UTF-8 encoded TXT file.

2.1 Complete runnable code

import requests
from bs4 import BeautifulSoup
from typing import List, Dict
from datetime import datetime  # 示例追加模式会用到

def scrape_movies() -> List[Dict]:
    """爬取练习网站首页10部电影的结构化数据"""
    base_url = "https://ssr1.scrape.center/"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    }

    try:
        response = requests.get(base_url, headers=headers, timeout=10)
        response.raise_for_status()   # 自动抛出 HTTP 错误
        soup = BeautifulSoup(response.text, 'html.parser')
        movies = []

        for item in soup.select('.el-card'):
            name = item.select_one('h2').text.strip()
            categories = [c.text.strip() for c in item.select('.categories button')]
            # 注意：示例网站的上映时间藏在 .info 中
            published_at = item.select_one('.info:contains("上映")').text.split('：')[-1].strip()
            score = item.select_one('.score').text.strip()

            movies.append({
                'name': name,
                'categories': categories,
                'published_at': published_at,
                'score': score
            })

        return movies

    except requests.exceptions.RequestException as e:
        print(f"请求失败，原因：{e}")
        return []
    except AttributeError as e:
        print(f"解析 HTML 失败，检查选择器：{e}")
        return []

def save_to_txt(data: List[Dict], filename: str = 'movies_output.txt'):
    """将结构化电影数据保存为易读的 TXT 文件"""
    if not data:
        print("没有可保存的数据")
        return

    try:
        with open(filename, 'w', encoding='utf-8') as f:
            for idx, movie in enumerate(data, 1):
                f.write(f"【第{idx}部电影】\n")
                f.write(f"电影名称：{movie['name']}\n")
                f.write(f"类型标签：{', '.join(movie['categories'])}\n")
                f.write(f"上映时间：{movie['published_at']}\n")
                f.write(f"豆瓣评分（示例）：{movie['score']}\n")
                f.write("=" * 60 + "\n\n")   # 分隔线，增强可读性

        print(f"✅ 成功保存{len(data)}部电影数据到：{filename}")

    except IOError as e:
        print(f"❌ 文件保存失败，原因：{e}")

if __name__ == '__main__':
    movies = scrape_movies()
    save_to_txt(movies)

2.2 Disassembly of code highlights

Compared with the "Getting Started TXT Crawler Code" randomly searched on the Internet, this version has made several practical improvements:

Clear type annotations:-> List[Dict]Such annotation makes it easier to pass parameters incorrectly when writing, and the return value structure can be understood at a glance when reading.
Perfect error branch: handle "request failure" and "parse failure" separately, so you don't have to guess where the problem is when debugging.
Force specifying UTF-8 encoding: Completely solve the problem of garbled characters when opening Windows Notepad in GBK by default.
Output with serial numbers, labels, and strong separation lines: You can quickly locate the movie you want to watch by scanning it with the naked eye.
Check data first and then save: Avoid generating empty files and reduce meaningless file residues.

3. Must-know Python TXT file operations

Used in the previous actual combatwWrite mode, but there is much more to TXT operations than just that. The core is to master the three points of opening mode, with statement, and coding standards.

3.1 Core open mode comparison table

Pattern	Full name	Core behavior	Must-know details	Applicable scenarios
`r`	read	Open an existing file for read only	An error will be reported if the file does not exist	Read already saved TXT data
`w`	write	Overwrite writing	If the file does not exist, it will be created; if it exists, the original content will be cleared and rewritten**	Generate a complete structured TXT for the first time
`a`	append	Append writing	If the file does not exist, it will be created; if it exists, it will be added at the end of the file	Incremental crawling and logging
`r+`	read+write	Open existing file for reading and writing	Overwrite starting from the current pointer position (default start)	Modify existing TXT in a small range

⚠️ Tips for newbies: Don’t use itwMode to "modify a certain line of an existing file" - use it directlyr+Or "read all contents → modify → overwrite" is safer.

3.2 Advanced common operations

3.2.1 Incremental append mode (most commonly used when crawlers crawl multiple pages)

def append_to_txt(movie: Dict, filename: str = 'movies_append.txt'):
    """单条追加电影数据到 TXT"""
    try:
        with open(filename, 'a', encoding='utf-8') as f:
            # 加当前时间戳，方便追溯是哪次爬取的
            timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            f.write(f"[{timestamp}] ")
            f.write(f"电影：{movie['name']} | 类型：{', '.join(movie['categories'])} | 评分：{movie['score']}\n")
    except IOError as e:
        print(f"追加失败：{e}")

# 调用示例
if movies:
    append_to_txt(movies[0])

3.2.2 Batch writing improves performance

If you want to write tens of thousands of data, do not useforLoop line by linef.write(), should usef.writelines(), write-once performance is higher:

# 先生成所有要写入的字符串列表
lines_to_write = []
for idx, movie in enumerate(movies, 1):
    lines_to_write.append(f"【第{idx}部】{movie['name']}\n")
    lines_to_write.append(f"类型：{', '.join(movie['categories'])}\n")
    lines_to_write.append("-" * 30 + "\n")

# 一次性写入
with open('bulk_movies.txt', 'w', encoding='utf-8') as f:
    f.writelines(lines_to_write)

4. Guide to safe pit avoidance

Although TXT is very simple, there are a few "little tricks" that are easy to step on:

Garbled Thunder: No matter what operating system you are using, whenever you write a text file, please be sure to addencoding='utf-8'。

Path Mine: usepathlib.PathAutomatically adapt to cross-platform paths and stay away from handwriting\or/Troubles:

from pathlib import Path

output_dir = Path("output")
output_dir.mkdir(exist_ok=True)   # 文件夹已存在也不会报错

safe_file = output_dir / "safe_movies.txt"

Permissions: If sensitive data is stored, you can set stricter file permissions on Linux/macOS:

import os

if os.name == "posix":   # 仅对 macOS/Linux 生效
    os.chmod(safe_file, 0o600)   # 只有文件所有者可读可写

5. Summary

TXT is not a panacea, but in the three scenarios of "fast verification, temporary transition, and lightweight logging", it is definitely the most cost-effective choice. When you need structured retrieval later, you can easily convert TXT to JSON/CSV, or import it into a SQLite database - this is what we will talk about in the next article.

What kind of crawler data do you usually use TXT to save? Welcome to communicate in the comment area 😉

#title: Simple TXT plain text file storage description: Python crawler data storage: TXT text storage guide

#Python crawler data storage: TXT text storage guide

#1. Why choose TXT?

#Suitable scene

#2. Practical example: Crawl the sample movie website and save TXT

#2.1 Complete runnable code

#2.2 Disassembly of code highlights

#3. Must-know Python TXT file operations

#3.1 Core open mode comparison table

#3.2 Advanced common operations

#3.2.1 Incremental append mode (most commonly used when crawlers crawl multiple pages)

#3.2.2 Batch writing improves performance

#4. Guide to safe pit avoidance

#5. Summary