title: Simple TXT plain text file storage description: Python crawler data storage: TXT text storage guide

Python crawler data storage: TXT text storage guide

When you are a newbie practicing crawling, saving intermediate data for temporary projects, and don’t want to worry about dependencies, TXT is the “lightweight savior” you can’t escape - but how to save it quickly and clearly without losing data? This article will tell you everything.

1. Why choose TXT?

Don’t underestimate this format that existed in the last century. It has three irreplaceable core advantages:

  • Zero threshold to get started: No need to install any libraries (json, csv, pandas are not required), Python comes with itopen()You can work directly.
  • All platforms available: Windows Notepad, macOS Text Editor, LinuxcatCommand, you can open it and look at it at will, without any environment dependence.
  • Absolutely lightweight and non-redundant: no schema, no headers, no encoding locks - as long as you remember to forcibly specify UTF-8, there will be no garbled characters.

Of course, the shortcomings are also there:

  • It is not possible to quickly filter by conditions such as "Rating > 9" and "Science Fiction Movies".
  • It is difficult to automatically verify data integrity (for example, if the release time is missed, it cannot be seen in the file at all).
  • There is a lag when opening large files, and the modification efficiency is extremely low.

Suitable scene

✅ Practice crawlers and verify crawling results ✅ Short-term temporary storage of "transitional data" before JSON/CSV ✅ Simple debugging logs and monitoring records ❌ Requires structured retrieval and statistics ❌ Long-term persistence of core data ❌ Massive data processing exceeding one million levels


2. Practical example: Crawl the sample movie website and save TXT

Let’s take the public crawler practice website https://ssr1.scrape.center/ as a demonstration. Crawl the names, genres, release times, and ratings of the 10 movies on the homepage, and then generate an **easy-to-read, delimited, UTF-8 encoded TXT file.

2.1 Complete runnable code

import requests
from bs4 import BeautifulSoup
from typing import List, Dict
from datetime import datetime  # 示例追加模式会用到

def scrape_movies() -> List[Dict]:
    """爬取练习网站首页10部电影的结构化数据"""
    base_url = "https://ssr1.scrape.center/"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    }

    try:
        response = requests.get(base_url, headers=headers, timeout=10)
        response.raise_for_status()   # 自动抛出 HTTP 错误
        soup = BeautifulSoup(response.text, 'html.parser')
        movies = []

        for item in soup.select('.el-card'):
            name = item.select_one('h2').text.strip()
            categories = [c.text.strip() for c in item.select('.categories button')]
            # 注意:示例网站的上映时间藏在 .info 中
            published_at = item.select_one('.info:contains("上映")').text.split(':')[-1].strip()
            score = item.select_one('.score').text.strip()

            movies.append({
                'name': name,
                'categories': categories,
                'published_at': published_at,
                'score': score
            })

        return movies

    except requests.exceptions.RequestException as e:
        print(f"请求失败,原因:{e}")
        return []
    except AttributeError as e:
        print(f"解析 HTML 失败,检查选择器:{e}")
        return []

def save_to_txt(data: List[Dict], filename: str = 'movies_output.txt'):
    """将结构化电影数据保存为易读的 TXT 文件"""
    if not data:
        print("没有可保存的数据")
        return

    try:
        with open(filename, 'w', encoding='utf-8') as f:
            for idx, movie in enumerate(data, 1):
                f.write(f"【第{idx}部电影】\n")
                f.write(f"电影名称:{movie['name']}\n")
                f.write(f"类型标签:{', '.join(movie['categories'])}\n")
                f.write(f"上映时间:{movie['published_at']}\n")
                f.write(f"豆瓣评分(示例):{movie['score']}\n")
                f.write("=" * 60 + "\n\n")   # 分隔线,增强可读性

        print(f"✅ 成功保存{len(data)}部电影数据到:{filename}")

    except IOError as e:
        print(f"❌ 文件保存失败,原因:{e}")

if __name__ == '__main__':
    movies = scrape_movies()
    save_to_txt(movies)

2.2 Disassembly of code highlights

Compared with the "Getting Started TXT Crawler Code" randomly searched on the Internet, this version has made several practical improvements:

  1. Clear type annotations:-> List[Dict]Such annotation makes it easier to pass parameters incorrectly when writing, and the return value structure can be understood at a glance when reading.
  2. Perfect error branch: handle "request failure" and "parse failure" separately, so you don't have to guess where the problem is when debugging.
  3. Force specifying UTF-8 encoding: Completely solve the problem of garbled characters when opening Windows Notepad in GBK by default.
  4. Output with serial numbers, labels, and strong separation lines: You can quickly locate the movie you want to watch by scanning it with the naked eye.
  5. Check data first and then save: Avoid generating empty files and reduce meaningless file residues.

3. Must-know Python TXT file operations

Used in the previous actual combatwWrite mode, but there is much more to TXT operations than just that. The core is to master the three points of opening mode, with statement, and coding standards.

3.1 Core open mode comparison table

PatternFull nameCore behaviorMust-know detailsApplicable scenarios
rreadOpen an existing file for read onlyAn error will be reported if the file does not existRead already saved TXT data
wwriteOverwrite writingIf the file does not exist, it will be created; if it exists, the original content will be cleared and rewritten**Generate a complete structured TXT for the first time
aappendAppend writingIf the file does not exist, it will be created; if it exists, it will be added at the end of the fileIncremental crawling and logging
r+read+writeOpen existing file for reading and writingOverwrite starting from the current pointer position (default start)Modify existing TXT in a small range

⚠️ Tips for newbies: Don’t use itwMode to "modify a certain line of an existing file" - use it directlyr+Or "read all contents → modify → overwrite" is safer.

3.2 Advanced common operations

3.2.1 Incremental append mode (most commonly used when crawlers crawl multiple pages)

def append_to_txt(movie: Dict, filename: str = 'movies_append.txt'):
    """单条追加电影数据到 TXT"""
    try:
        with open(filename, 'a', encoding='utf-8') as f:
            # 加当前时间戳,方便追溯是哪次爬取的
            timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            f.write(f"[{timestamp}] ")
            f.write(f"电影:{movie['name']} | 类型:{', '.join(movie['categories'])} | 评分:{movie['score']}\n")
    except IOError as e:
        print(f"追加失败:{e}")

# 调用示例
if movies:
    append_to_txt(movies[0])

3.2.2 Batch writing improves performance

If you want to write tens of thousands of data, do not useforLoop line by linef.write(), should usef.writelines(), write-once performance is higher:

# 先生成所有要写入的字符串列表
lines_to_write = []
for idx, movie in enumerate(movies, 1):
    lines_to_write.append(f"【第{idx}部】{movie['name']}\n")
    lines_to_write.append(f"类型:{', '.join(movie['categories'])}\n")
    lines_to_write.append("-" * 30 + "\n")

# 一次性写入
with open('bulk_movies.txt', 'w', encoding='utf-8') as f:
    f.writelines(lines_to_write)

4. Guide to safe pit avoidance

Although TXT is very simple, there are a few "little tricks" that are easy to step on:

  1. Garbled Thunder: No matter what operating system you are using, whenever you write a text file, please be sure to addencoding='utf-8'
  2. Path Mine: usepathlib.PathAutomatically adapt to cross-platform paths and stay away from handwriting\or/Troubles:
    from pathlib import Path
    
    output_dir = Path("output")
    output_dir.mkdir(exist_ok=True)   # 文件夹已存在也不会报错
    
    safe_file = output_dir / "safe_movies.txt"
  3. Permissions: If sensitive data is stored, you can set stricter file permissions on Linux/macOS:
    import os
    
    if os.name == "posix":   # 仅对 macOS/Linux 生效
        os.chmod(safe_file, 0o600)   # 只有文件所有者可读可写

5. Summary

TXT is not a panacea, but in the three scenarios of "fast verification, temporary transition, and lightweight logging", it is definitely the most cost-effective choice. When you need structured retrieval later, you can easily convert TXT to JSON/CSV, or import it into a SQLite database - this is what we will talk about in the next article.

What kind of crawler data do you usually use TXT to save? Welcome to communicate in the comment area 😉