Python automates processing of PDF files
In the workplace, PDF (Portable Document Format) is irreplaceable - its cross-platform What You See Is What You Get feature perfectly avoids the nightmare of Word/PPT typesetting crashing on different devices. In the daily automated office process of Daoman Python AI, whether it is sorting scanned documents in batches, generating data reports, or encrypting contracts in batches, Python is the "Swiss Army Knife" for processing PDFs.
This article has compiled 5 high-frequency and practical scenarios. Each one is equipped with code that can be directly modified and run. You can use it after reading it👇
1. Extract text from PDF
If you just want to quickly remove the text from a standard text PDF,PyPDF2 is the first choice: lightweight, stable, and the threshold for getting started is extremely low.
Installation and basic use
First install with one click through pip:
The extraction code for standard text PDF is very intuitive:
✅ Added here
orEmpty judgment prompts to facilitate locating "non-text pages" and more practical
Frequently Asked Questions and Alternatives
- Chinese garbled code:
PyPDF2The Chinese font mapping is generally done, tryPyMuPDF(Performance maxed out) orpdfminer.six(Focus on layout, retain typesetting); - Scanned PDF/OCR requirements: None of the above libraries will work, you have to use
PaddleOCRorTesseractSuch specialized OCR tools.
2. Basic PDF "minor surgery": rotation, encryption, merging
PyPDF2ofPdfWriterIt is a core object specially used to generate/modify existing PDF, withPdfReaderYou can complete most "document-level operations".
2.1 Single page/batch rotation
Scans are often crooked, or do you need to adjust the orientation of odd and even pages in batches? Watch this 👇
2.2 Encrypt sensitive files in batches
Do you need to add access passwords in batches to private documents such as payslips and contracts? One line of code to complete the core logic:
3. Batch superimposed watermarks: the wonderful use of page merging
usepage.merge_page()The content of a "watermark page PDF" can be pixel-level superimposed onto each page of the target PDF - this method will not change the readability and text editability of the source document, and is a standard practice for batch watermarking of "internal information", "drafts" and "company logo".
Preparation
First make a PDF with only watermark page: you can use PPT / Word / PS, just leave only the watermark element, and set the background to transparent or white (if you set it to white, the watermark will cover the original text of the PDF, so transparent is preferred!).
Batch watermark code
4. Generate professional PDF reports from scratch
If you want to dynamically generate reports with custom fonts, graphics, and tables based on data,PyPDF2It’s not enough – it’s necessary**reportlab**, it is a "sketchpad" for Python to generate industrial-grade PDF.
Installation and basic report generation
Install dependencies first:
Simple report code with Chinese font
reportlabChinese is not supported by default. You need to register a local Chinese font file first (such as the one that comes with WindowsSimSun.ttf):
5. Summary and tool selection
💡 Office Best Practices: For contracts/reports that have strict formatting requirements, it is recommended to first use Python to fill in the Excel data into a Word template, and finally convert it into PDF in batches - it is easy to maintain the formatting and ensure that the final file cannot be tampered with!

