Python automates processing of PDF files

In the workplace, PDF (Portable Document Format) is irreplaceable - its cross-platform What You See Is What You Get feature perfectly avoids the nightmare of Word/PPT typesetting crashing on different devices. In the daily automated office process of Daoman Python AI, whether it is sorting scanned documents in batches, generating data reports, or encrypting contracts in batches, Python is the "Swiss Army Knife" for processing PDFs.

This article has compiled 5 high-frequency and practical scenarios. Each one is equipped with code that can be directly modified and run. You can use it after reading it👇


1. Extract text from PDF

If you just want to quickly remove the text from a standard text PDF,PyPDF2 is the first choice: lightweight, stable, and the threshold for getting started is extremely low.

Installation and basic use

First install with one click through pip:

pip install PyPDF2

The extraction code for standard text PDF is very intuitive:

import PyPDF2

# 第一步:用 PdfReader 打开文件(注意 rb 二进制读模式)
with open("test.pdf", "rb") as f:
    reader = PyPDF2.PdfReader(f)
    # 获取总页数
    total_pages = len(reader.pages)
    print(f"当前文档共 {total_pages}\n")

    # 第二步:逐页提取并输出
    for i, page in enumerate(reader.pages, 1):
        content = page.extract_text()
        print(f"--- 第 {i} 页内容 ---")
        print(content.strip() or "(该页无标准文本或仅为图片页)\n")

✅ Added hereorEmpty judgment prompts to facilitate locating "non-text pages" and more practical

Frequently Asked Questions and Alternatives

  • Chinese garbled code:PyPDF2The Chinese font mapping is generally done, tryPyMuPDF(Performance maxed out) orpdfminer.six(Focus on layout, retain typesetting);
  • Scanned PDF/OCR requirements: None of the above libraries will work, you have to usePaddleOCRorTesseractSuch specialized OCR tools.

2. Basic PDF "minor surgery": rotation, encryption, merging

PyPDF2ofPdfWriterIt is a core object specially used to generate/modify existing PDF, withPdfReaderYou can complete most "document-level operations".

2.1 Single page/batch rotation

Scans are often crooked, or do you need to adjust the orientation of odd and even pages in batches? Watch this 👇

import PyPDF2

# 初始化读写对象
with open("manual.pdf", "rb") as f_in, open("rotated.pdf", "wb") as f_out:
    reader = PyPDF2.PdfReader(f_in)
    writer = PyPDF2.PdfWriter()

    for i, page in enumerate(reader.pages):
        # 旋转角度:正数是顺时针,负数是逆时针,必须是 90/180/270 的倍数
        if i % 2 == 0:  # 偶数页顺时针转90度
            page.rotate(90)
        writer.add_page(page)

    # 最后写入新文件
    writer.write(f_out)

2.2 Encrypt sensitive files in batches

Do you need to add access passwords in batches to private documents such as payslips and contracts? One line of code to complete the core logic:

import PyPDF2
from pathlib import Path  # 路径处理更友好,自动批量操作的基础

# 批量加密文件夹下所有 PDF
source_dir = Path("./contracts/")
output_dir = Path("./encrypted_contracts/")
output_dir.mkdir(exist_ok=True)  # 没有输出文件夹自动建

for pdf_path in source_dir.glob("*.pdf"):
    with open(pdf_path, "rb") as f_in, open(output_dir / f"{pdf_path.stem}_加密.pdf", "wb") as f_out:
        reader = PyPDF2.PdfReader(f_in)
        writer = PyPDF2.PdfWriter()

        # 复制所有页
        for page in reader.pages:
            writer.add_page(page)

        # 设置密码:encrypt("访问密码", "修改密码",这里只设访问密码
        writer.encrypt("your_contract_password")
        writer.write(f_out)

print("✅ 所有合同加密完成!")

3. Batch superimposed watermarks: the wonderful use of page merging

usepage.merge_page()The content of a "watermark page PDF" can be pixel-level superimposed onto each page of the target PDF - this method will not change the readability and text editability of the source document, and is a standard practice for batch watermarking of "internal information", "drafts" and "company logo".

Preparation

First make a PDF with only watermark page: you can use PPT / Word / PS, just leave only the watermark element, and set the background to transparent or white (if you set it to white, the watermark will cover the original text of the PDF, so transparent is preferred!).

Batch watermark code

import PyPDF2
from pathlib import Path

source_dir = Path("./source_docs/")
output_dir = Path("./watermarked_docs/")
output_dir.mkdir(exist_ok=True)
watermark_path = Path("./watermark_template.pdf")  # 提前准备的透明水印页

# 打开水印阅读器,只需要读一次水印页
with open(watermark_path, "rb") as f_wm:
    wm_reader = PyPDF2.PdfReader(f_wm)
    wm_page = wm_reader.pages[0]

    # 批量处理源文件夹下的所有 PDF
    for pdf_path in source_dir.glob("*.pdf"):
        with open(pdf_path, "rb") as f_in, open(output_dir / f"{pdf_path.stem}_带水印.pdf", "wb") as f_out:
            reader = PyPDF2.PdfReader(f_in)
            writer = PyPDF2.PdfWriter()

            for page in reader.pages:
                # 叠加水印:先合并,再添加到 writer
                page.merge_page(wm_page)
                writer.add_page(page)

            writer.write(f_out)

print("✅ 所有文档加水印完成!")

4. Generate professional PDF reports from scratch

If you want to dynamically generate reports with custom fonts, graphics, and tables based on data,PyPDF2It’s not enough – it’s necessary**reportlab**, it is a "sketchpad" for Python to generate industrial-grade PDF.

Installation and basic report generation

Install dependencies first:

pip install reportlab

Simple report code with Chinese font

reportlabChinese is not supported by default. You need to register a local Chinese font file first (such as the one that comes with WindowsSimSun.ttf):

from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from pathlib import Path

# 第一步:准备字体路径(按需修改成你本地的中文字体)
# Windows 常用字体路径:C:/Windows/Fonts/
# macOS 常用字体路径:/System/Library/Fonts/
font_path = Path("SimSun.ttf")  # 如果放项目根目录也可以

# 第二步:初始化画布
c = canvas.Canvas("daoman_ai_report.pdf", pagesize=(595, 842))  # 默认A4尺寸:595x842像素

# 第三步:注册中文字体
pdfmetrics.registerFont(TTFont("SimSun", font_path))

# 第四步:绘制文字、图形、线条
c.setFont("SimSun", 24)
c.drawString(100, 750, "道满科技:202X 年度 AI 自动化趋势报告")

# 加一条分隔线
c.setStrokeColorRGB(0.2, 0.5, 0.8)  # 蓝色RGB
c.line(100, 730, 495, 730)
c.setStrokeColorRGB(0, 0, 0)  # 恢复黑色

# 绘制小标题
c.setFont("SimSun", 16)
c.drawString(100, 690, "一、核心功能覆盖场景")

# 绘制项目符号列表
c.setFont("SimSun", 12)
c.drawString(120, 660, "1. 批量文本/扫描件文档处理")
c.drawString(120, 630, "2. Excel/Word 模板自动填充")
c.drawString(120, 600, "3. 行业定制化报告生成")

# 第五步:保存画布
c.save()
print("✅ 报告生成成功!")

5. Summary and tool selection

Tool LibraryCore PositioningApplicable Scenarios
PyPDF2Document "Surgeon"Cut/Merge/Rotate/Encrypt/Simple Text Extraction
reportlabReport "Creator"Build a professional PDF from scratch with custom layout
PyMuPDFPerformance "Bulldozer"Complex typesetting text extraction/high-speed PDF to image conversion
pdfminer.sixLayout "Analyst"Extraction that needs to preserve the text position of the original document

💡 Office Best Practices: For contracts/reports that have strict formatting requirements, it is recommended to first use Python to fill in the Excel data into a Word template, and finally convert it into PDF in batches - it is easy to maintain the formatting and ensure that the final file cannot be tampered with!