If you regularly work with PDF files for office work, data analysis, report sorting or document management, you must be familiar with endless repetitive operations: merging scattered PDF reports, splitting oversized PDF files by pages, extracting text and tables from scanned documents, adding uniform watermarks to batches of files, or erasing sensitive information before sharing files. These manual PDF operations are not only time-consuming but also prone to human errors, especially when processing dozens or hundreds of PDF files in batches.
Luckily, Python offers a simple and efficient solution. With a set of ready-to-use scripts, you can fully automate almost all daily PDF workflows. Today, we will break down 5 practical Python scripts tailored for mainstream PDF tasks. All scripts support command line running and batch processing, require no complex development skills, and will completely revolutionize how you handle PDF files.
1. Merge & Split PDF Files: Handle Bulk PDFs in One Click
Manually combining multiple PDFs or splitting a large PDF page by page is one of the most common and annoying PDF tasks, especially for files with hundreds of pages or a whole folder of scattered PDF documents.
What this script does
This all-in-one script supports two core modes via a simple mode flag: merge and split.
- Merge mode: Automatically reads all PDFs in a target folder, sorts files by filename (or a custom order defined in a text file), and combines them into a single complete PDF. It also retains metadata from the first original file to ensure document integrity.
- Split mode: Splits a single large PDF flexibly. You can set fixed page chunks (split every N pages), custom page ranges, or specific single pages. Each split segment will be saved as an independent numbered PDF file.
Sample Script
""" pdf_merge_split.py Merge multiple PDF files into one, or split a PDF by page ranges or chunk size. Dependencies: pypdf Install: pip install pypdf Usage — merge: python pdf_merge_split.py merge --input ./pdfs --output merged.pdf python pdf_merge_split.py merge --input ./pdfs --output merged.pdf --order order.txt Usage — split: python pdf_merge_split.py split --input report.pdf --output-dir ./splits --every 10 python pdf_merge_split.py split --input report.pdf --output-dir ./splits --ranges "1-5,6-12,13-" python pdf_merge_split.py split --input report.pdf --output-dir ./splits --on-pages 10 20 35 order.txt format (merge mode): One filename per line, in the order they should be merged: chapter1.pdf chapter2.pdf appendix.pdf """ import argparse import sys from pathlib import Path from pypdf import PdfReader, PdfWriter # ── CONFIG ──────────────────────────────────────────────────────────────────── INPUT_FOLDER = "./pdfs" OUTPUT_FILE = "merged.pdf" INPUT_FILE = "input.pdf" OUTPUT_DIR = "./splits" CHUNK_SIZE = 10 # Pages per split file (--every mode) ORDER_FILE = None # Text file with filenames in merge order # ───────────────────────────────────────────────────────────────────────────── def merge(input_folder: str, output_file: str, order_file: str | None) -> None: folder = Path(input_folder) if not folder.exists(): sys.exit(f"[ERROR] Input folder not found: {folder}") all_pdfs = sorted(folder.glob("*.pdf")) if not all_pdfs: sys.exit(f"[ERROR] No PDF files found in: {folder}") # Apply custom order if provided if order_file: opath = Path(order_file) if not opath.exists(): sys.exit(f"[ERROR] Order file not found: {opath}") names = [line.strip() for line in opath.read_text().splitlines() if line.strip()] ordered = [] for name in names: match = folder / name if match.exists(): ordered.append(match) else: print(f" [WARN] File in order list not found: {name}") # Append any files not in order list at the end listed = set(ordered) for p in all_pdfs: if p not in listed: ordered.append(p) print(f" [WARN] '{p.name}' not in order file — appending at end") all_pdfs = ordered print(f"Merging {len(all_pdfs)} file(s) into: {output_file}\n") writer = PdfWriter() total_pages = 0 for pdf_path in all_pdfs: try: reader = PdfReader(pdf_path) pages = len(reader.pages) for page in reader.pages: writer.add_page(page) total_pages += pages print(f" ✓ {pdf_path.name:50s} {pages:>4} pages") except Exception as e: print(f" ✗ {pdf_path.name:50s} FAILED — {e}") # Copy metadata from first readable file try: first_reader = PdfReader(all_pdfs[0]) if first_reader.metadata: writer.add_metadata(dict(first_reader.metadata)) except Exception: pass out = Path(output_file) out.parent.mkdir(parents=True, exist_ok=True) with open(out, "wb") as fh: writer.write(fh) print(f"\nTotal pages : {total_pages:,}") print(f"Output : {out.resolve()}") def parse_ranges(ranges_str: str, total_pages: int) -> list[tuple[int, int]]: """Parse '1-5,6-12,13-' into list of (start, end) tuples (0-based, exclusive end).""" segments = [] for part in ranges_str.split(","): part = part.strip() if not part: continue if "-" in part: left, right = part.split("-", 1) start = int(left.strip()) - 1 if left.strip() else 0 end = int(right.strip()) if right.strip() else total_pages else: start = int(part) - 1 end = int(part) segments.append((max(0, start), min(total_pages, end))) return segments def split(input_file: str, output_dir: str, every: int | None, ranges: str | None, on_pages: list[int] | None) -> None: src = Path(input_file) if not src.exists(): sys.exit(f"[ERROR] File not found: {src}") reader = PdfReader(src) total = len(reader.pages) stem = src.stem out_dir = Path(output_dir) out_dir.mkdir(parents=True, exist_ok=True) print(f"Splitting: {src} ({total} pages)\n") # Build list of (start, end) segments — 0-based, exclusive end if ranges: segments = parse_ranges(ranges, total) elif on_pages: # Split ON these page numbers (1-based): break before each listed page breaks = sorted(set([0] + [p - 1 for p in on_pages] + [total])) segments = [(breaks[i], breaks[i + 1]) for i in range(len(breaks) - 1)] elif every: segments = [(i, min(i + every, total)) for i in range(0, total, every)] else: sys.exit("[ERROR] Specify --every, --ranges, or --on-pages.") print(f"Segments: {len(segments)}\n") for idx, (start, end) in enumerate(segments, 1): if start >= end: continue writer = PdfWriter() for page_idx in range(start, end): writer.add_page(reader.pages[page_idx]) out_name = f"{stem}_part{idx:03d}_pages{start+1}-{end}.pdf" out_path = out_dir / out_name with open(out_path, "wb") as fh: writer.write(fh) print(f" Part {idx:03d}: pages {start+1}–{end} → {out_name}") print(f"\nOutput directory: {out_dir.resolve()}") def main(): parser = argparse.ArgumentParser(description="Merge or split PDF files.") sub = parser.add_subparsers(dest="command", required=True) # Merge subcommand mp = sub.add_parser("merge", help="Merge multiple PDFs into one") mp.add_argument("--input", default=INPUT_FOLDER, help="Folder containing PDFs") mp.add_argument("--output", default=OUTPUT_FILE, help="Output PDF path") mp.add_argument("--order", default=ORDER_FILE, help="Text file with filenames in order") # Split subcommand sp = sub.add_parser("split", help="Split a PDF into parts") sp.add_argument("--input", default=INPUT_FILE, help="PDF file to split") sp.add_argument("--output-dir", default=OUTPUT_DIR) sp.add_argument("--every", type=int, help="Split every N pages") sp.add_argument("--ranges", help="Page ranges, e.g. '1-5,6-12,13-'") sp.add_argument("--on-pages", type=int, nargs="+", help="Split before these page numbers (1-based)") args = parser.parse_args() if args.command == "merge": merge(args.input, args.output, args.order) else: split(args.input, args.output_dir, args.every, args.ranges, args.on_pages) if __name__ == "__main__": main()
The script relies on the pypdf library to implement all page-level operations. It runs stably, consumes low system resources, and is suitable for both small individual files and large-scale batch processing. No need to open any PDF editing software throughout the process.
2. Extract Text & Tables: Get Structured Data from PDFs Quickly
Copying text or tables directly from PDFs often leads to messy formats, garbled content or missing table data. For work scenarios such as data statistics, report sorting and content archiving, manual extraction is almost inefficient.
What this script does
This script can batch extract text and table content from single or multiple PDFs and export data to standard structured files: Pure text content can be saved as plain text or Markdown files, retaining basic layout.
Extracted tables are automatically exported to CSV or Excel files, with each independent table stored in a separate sheet. It also generates a detailed summary report, marking the total pages, the number of detected tables in each PDF, and pages with empty extraction results for easy troubleshooting.
Sample Script
""" pdf_extractor.py Extract text and tables from PDF files into structured output files. Dependencies: pypdf, pdfplumber, pandas, openpyxl Install: pip install pypdf pdfplumber pandas openpyxl Usage: python pdf_extractor.py --input report.pdf python pdf_extractor.py --input report.pdf --mode tables --output-dir ./extracted python pdf_extractor.py --input ./pdfs --mode both --format xlsx """ import argparse import sys from pathlib import Path import pandas as pd import pdfplumber from pypdf import PdfReader # ── CONFIG ──────────────────────────────────────────────────────────────────── INPUT_PATH = "input.pdf" # Single PDF or folder of PDFs OUTPUT_DIR = "./extracted" MODE = "both" # text | tables | both TEXT_FORMAT = "txt" # txt | md (for text output) TABLE_FORMAT = "xlsx" # csv | xlsx (for table output) # ───────────────────────────────────────────────────────────────────────────── def extract_text_pypdf(pdf_path: Path) -> str: """Fallback text extraction using pypdf.""" reader = PdfReader(pdf_path) parts = [] for i, page in enumerate(reader.pages): text = page.extract_text() or "" if text.strip(): parts.append(f"--- Page {i + 1} ---\n{text}") return "\n\n".join(parts) def extract_text_pdfplumber(pdf_path: Path) -> tuple[str, int, int]: """Extract text with pdfplumber; returns (text, page_count, empty_page_count).""" parts = [] empty_pages = 0 with pdfplumber.open(pdf_path) as pdf: page_count = len(pdf.pages) for i, page in enumerate(pdf.pages): text = page.extract_text(x_tolerance=2, y_tolerance=2) or "" if text.strip(): parts.append(f"--- Page {i + 1} ---\n{text.strip()}") else: empty_pages += 1 return "\n\n".join(parts), page_count, empty_pages def extract_tables(pdf_path: Path) -> list[dict]: """Extract all tables from a PDF. Returns list of {page, table_index, df}.""" results = [] table_settings = { "vertical_strategy": "lines", "horizontal_strategy": "lines", "snap_tolerance": 3, } with pdfplumber.open(pdf_path) as pdf: for page_num, page in enumerate(pdf.pages, 1): # Try structured line-based detection first tables = page.extract_tables(table_settings) if not tables: # Fall back to text-based detection tables = page.extract_tables() for t_idx, table in enumerate(tables): if not table or len(table) < 2: continue # Use first row as header if it looks like one header = table[0] data = table[1:] # Clean up: replace None with empty string header = [str(h).strip() if h else f"col_{i}" for i, h in enumerate(header)] rows = [[str(c).strip() if c is not None else "" for c in row] for row in data] # Remove fully empty rows rows = [r for r in rows if any(r)] if not rows: continue df = pd.DataFrame(rows, columns=header) results.append({ "page": page_num, "table_index": t_idx + 1, "df": df, }) return results def write_text(text: str, out_path: Path, fmt: str) -> None: if fmt == "md": # Wrap page separators as markdown headers text = text.replace("--- Page ", "## Page ").replace(" ---", "") out_path = out_path.with_suffix(".md") out_path.write_text(text, encoding="utf-8") def write_tables(tables: list[dict], out_path: Path, fmt: str) -> None: if not tables: return if fmt == "xlsx": with pd.ExcelWriter(out_path.with_suffix(".xlsx"), engine="openpyxl") as writer: for t in tables: sheet = f"P{t['page']}_T{t['table_index']}"[:31] t["df"].to_excel(writer, sheet_name=sheet, index=False) else: # One CSV per table for t in tables: csv_path = out_path.parent / f"{out_path.stem}_p{t['page']}_t{t['table_index']}.csv" t["df"].to_csv(csv_path, index=False, encoding="utf-8-sig") def process_file(pdf_path: Path, out_dir: Path, mode: str, text_fmt: str, table_fmt: str) -> dict: stem = pdf_path.stem result = { "file": pdf_path.name, "pages": 0, "empty_pages": 0, "tables": 0, "text_chars": 0, "error": "", } try: if mode in ("text", "both"): text, page_count, empty_pages = extract_text_pdfplumber(pdf_path) if not text.strip(): # Fall back to pypdf text = extract_text_pypdf(pdf_path) empty_pages = 0 result["pages"] = page_count result["empty_pages"] = empty_pages result["text_chars"] = len(text) if text.strip(): write_text(text, out_dir / f"{stem}_text.txt", text_fmt) else: result["error"] += "No text extracted (possibly scanned). " if mode in ("tables", "both"): tables = extract_tables(pdf_path) result["tables"] = len(tables) if tables: write_tables(tables, out_dir / f"{stem}_tables", table_fmt) except Exception as e: result["error"] += str(e) return result def main(): parser = argparse.ArgumentParser(description="Extract text and tables from PDF files.") parser.add_argument("--input", default=INPUT_PATH, help="PDF file or folder of PDFs") parser.add_argument("--output-dir", default=OUTPUT_DIR) parser.add_argument("--mode", default=MODE, choices=["text", "tables", "both"]) parser.add_argument("--text-format", default=TEXT_FORMAT, choices=["txt", "md"]) parser.add_argument("--table-format", default=TABLE_FORMAT, choices=["csv", "xlsx"]) args = parser.parse_args() src = Path(args.input) if not src.exists(): sys.exit(f"[ERROR] Not found: {src}") out_dir = Path(args.output_dir) out_dir.mkdir(parents=True, exist_ok=True) pdfs = sorted(src.glob("*.pdf")) if src.is_dir() else [src] if not pdfs: sys.exit(f"[ERROR] No PDF files found in: {src}") print(f"Processing {len(pdfs)} file(s) | Mode: {args.mode}\n") summary = [] for pdf_path in pdfs: print(f" {pdf_path.name}") result = process_file(pdf_path, out_dir, args.mode, args.text_format, args.table_format) summary.append(result) if result["error"]: print(f" [WARN] {result['error']}") else: parts = [] if args.mode in ("text", "both"): parts.append(f"{result['pages']} pages, {result['text_chars']:,} chars") if args.mode in ("tables", "both"): parts.append(f"{result['tables']} table(s)") print(f" {' | '.join(parts)}") summary_path = out_dir / "_summary.csv" pd.DataFrame(summary).to_csv(summary_path, index=False) print(f"\nOutput directory : {out_dir.resolve()}") print(f"Summary : {summary_path.name}") if __name__ == "__main__": main()
It combines two mainstream libraries: use pypdf for basic text extraction, and pdfplumber for layout-preserving extraction and accurate table recognition. The script will automatically clean data (remove empty rows, identify table headers) to output usable standardized data directly.
3. Add Watermarks, Stamps & Page Numbers: Batch Customize PDF Styles
Adding watermarks, official stamps, uniform page numbers or header/footer notes to batches of PDFs is a necessary step before document distribution. Repeating the operation on each file via graphical PDF tools wastes a lot of working time.
What this script does
This script realizes one-click batch stamping for entire PDF folders. It supports diverse overlay effects with fully configurable parameters:
- Text watermarks (including diagonal watermarks), custom stamps and image overlays.
- Adjustable position, font size, transparency, color and rotation angle of all stamped content.
- Automatic page numbering for each page (generating exclusive stamps for different pages).
The original PDF files will not be modified, and all processed files will be saved as new independent documents.
Sample Script
""" pdf_stamper.py Add text watermarks, stamps, or page numbers to PDF files in batch. Dependencies: pypdf, reportlab Install: pip install pypdf reportlab Usage: python pdf_stamper.py --input report.pdf --text "CONFIDENTIAL" python pdf_stamper.py --input ./pdfs --text "DRAFT" --angle 45 --opacity 0.15 python pdf_stamper.py --input report.pdf --mode page-numbers --position bottom-center python pdf_stamper.py --input report.pdf --text "INTERNAL USE ONLY" --position top-center --angle 0 """ import argparse import io import sys from pathlib import Path from pypdf import PdfReader, PdfWriter from reportlab.lib.colors import Color, HexColor from reportlab.lib.pagesizes import A4 from reportlab.pdfgen import canvas # ── CONFIG ──────────────────────────────────────────────────────────────────── INPUT_PATH = "input.pdf" # Single PDF or folder of PDFs OUTPUT_DIR = "./stamped" MODE = "watermark" # watermark | stamp | page-numbers STAMP_TEXT = "CONFIDENTIAL" POSITION = "center" # center | top-left | top-center | top-right | # bottom-left | bottom-center | bottom-right ANGLE = 45 # Rotation angle in degrees OPACITY = 0.12 # 0.0 (invisible) to 1.0 (fully opaque) FONT_NAME = "Helvetica-Bold" FONT_SIZE = 48 # For watermark/stamp PAGE_NUM_SIZE = 10 # Font size for page numbers COLOR = "#CCCCCC" # Hex color for watermark text STAMP_COLOR = "#CC0000" # Hex color for stamp text PAGE_NUM_FMT = "Page {n} of {total}" # Page number format string # ───────────────────────────────────────────────────────────────────────────── def hex_to_color(hex_str: str, opacity: float) -> Color: base = HexColor(hex_str) return Color(base.red, base.green, base.blue, alpha=opacity) def get_position_coords(page_width: float, page_height: float, position: str, font_size: int) -> tuple[float, float]: margin = 20 cx, cy = page_width / 2, page_height / 2 positions = { "center": (cx, cy), "top-left": (margin, page_height - margin - font_size), "top-center": (cx, page_height - margin - font_size), "top-right": (page_width - margin, page_height - margin - font_size), "bottom-left": (margin, margin), "bottom-center": (cx, margin), "bottom-right": (page_width - margin, margin), } return positions.get(position, positions["center"]) def make_text_stamp(text: str, page_width: float, page_height: float, position: str, angle: float, opacity: float, font: str, font_size: int, color_hex: str) -> bytes: """Generate a single-page PDF stamp in memory.""" buf = io.BytesIO() c = canvas.Canvas(buf, pagesize=(page_width, page_height)) c.setFillColor(hex_to_color(color_hex, opacity)) c.setFont(font, font_size) x, y = get_position_coords(page_width, page_height, position, font_size) c.saveState() c.translate(x, y) c.rotate(angle) c.drawCentredString(0, 0, text) c.restoreState() c.save() buf.seek(0) return buf.read() def make_page_number_stamp(n: int, total: int, page_width: float, page_height: float, position: str, fmt: str, font_size: int) -> bytes: text = fmt.format(n=n, total=total) buf = io.BytesIO() c = canvas.Canvas(buf, pagesize=(page_width, page_height)) c.setFillColor(Color(0, 0, 0, alpha=0.7)) c.setFont("Helvetica", font_size) x, y = get_position_coords(page_width, page_height, position, font_size) c.drawCentredString(x, y, text) c.save() buf.seek(0) return buf.read() def stamp_pdf(pdf_path: Path, out_path: Path, mode: str, text: str, position: str, angle: float, opacity: float, font: str, font_size: int, color_hex: str, page_num_fmt: str, page_num_size: int) -> dict: result = {"file": pdf_path.name, "pages": 0, "error": ""} try: reader = PdfReader(pdf_path) writer = PdfWriter() total = len(reader.pages) result["pages"] = total for i, page in enumerate(reader.pages): w = float(page.mediabox.width) h = float(page.mediabox.height) if mode == "page-numbers": stamp_bytes = make_page_number_stamp( i + 1, total, w, h, position, page_num_fmt, page_num_size ) stamp_color = "#000000" else: stamp_color = STAMP_COLOR if mode == "stamp" else color_hex stamp_angle = 0 if mode == "stamp" else angle stamp_bytes = make_text_stamp( text, w, h, position, stamp_angle, opacity, font, font_size, stamp_color ) stamp_reader = PdfReader(io.BytesIO(stamp_bytes)) stamp_page = stamp_reader.pages[0] page.merge_page(stamp_page) writer.add_page(page) out_path.parent.mkdir(parents=True, exist_ok=True) with open(out_path, "wb") as fh: writer.write(fh) except Exception as e: result["error"] = str(e) return result def main(): parser = argparse.ArgumentParser(description="Add watermarks, stamps, or page numbers to PDFs.") parser.add_argument("--input", default=INPUT_PATH, help="PDF file or folder of PDFs") parser.add_argument("--output-dir", default=OUTPUT_DIR) parser.add_argument("--mode", default=MODE, choices=["watermark", "stamp", "page-numbers"]) parser.add_argument("--text", default=STAMP_TEXT, help="Text to stamp or watermark") parser.add_argument("--position", default=POSITION, choices=["center", "top-left", "top-center", "top-right", "bottom-left", "bottom-center", "bottom-right"]) parser.add_argument("--angle", type=float, default=ANGLE) parser.add_argument("--opacity", type=float, default=OPACITY, help="Text opacity 0.0–1.0") parser.add_argument("--font-size", type=int, default=FONT_SIZE) parser.add_argument("--color", default=COLOR, help="Hex color for watermark text") parser.add_argument("--page-num-fmt", default=PAGE_NUM_FMT, help="Page number format. Use {n} and {total}") parser.add_argument("--page-num-size", type=int, default=PAGE_NUM_SIZE) args = parser.parse_args() src = Path(args.input) if not src.exists(): sys.exit(f"[ERROR] Not found: {src}") pdfs = sorted(src.glob("*.pdf")) if src.is_dir() else [src] out_dir = Path(args.output_dir) out_dir.mkdir(parents=True, exist_ok=True) print(f"Mode : {args.mode}") if args.mode != "page-numbers": print(f"Text : {args.text}") print(f"Position : {args.position}") print(f"Files : {len(pdfs)}\n") for pdf_path in pdfs: out_path = out_dir / f"{pdf_path.stem}_{args.mode}.pdf" result = stamp_pdf( pdf_path, out_path, args.mode, args.text, args.position, args.angle, args.opacity, FONT_NAME, args.font_size, args.color, args.page_num_fmt, args.page_num_size, ) if result["error"]: print(f" ✗ {pdf_path.name:50s} ERROR — {result['error']}") else: print(f" ✓ {pdf_path.name:50s} {result['pages']} pages → {out_path.name}") print(f"\nOutput directory: {out_dir.resolve()}") if __name__ == "__main__": main()
It uses pypdf for PDF page merging and reportlab to dynamically generate stamp layers in memory. The whole process does not generate temporary files, which is efficient and safe.
4. Redact Sensitive Content: Permanently Hide Confidential Information
When sharing PDFs externally, you need to shield sensitive content such as names, phone numbers, email addresses, financial data and addresses. Drawing black boxes manually with PDF editors only covers content visually — the original text can still be extracted by others, bringing data leakage risks.
What this script does
This is a secure automated redaction tool. You can define matching rules in advance, including regular expressions, exact text strings, or preset categories like emails and phone numbers. The script will scan all PDF pages, locate matching sensitive content, and use black rectangles to permanently delete the underlying text, rather than just covering it.
After execution, a detailed redaction log will be generated, recording the page number, original sensitive text and matching rules for every redaction operation, which is convenient for audit and review.
Sample Script
""" pdf_redactor.py Permanently redact text matching patterns from PDF files. Dependencies: pymupdf Install: pip install pymupdf Usage: python pdf_redactor.py --input document.pdf --patterns "John Smith" "ACC-\d+" python pdf_redactor.py --input document.pdf --categories email phone python pdf_redactor.py --input ./pdfs --patterns "CONFIDENTIAL-\d+" --categories email Built-in categories (--categories): email Email addresses phone Phone numbers (various formats) ssn US Social Security numbers credit Credit card numbers postcode UK postcodes date Common date formats NOTE: Always verify redaction output before distributing. Test on a copy before processing originals. """ import argparse import re import sys from pathlib import Path import fitz # pymupdf # ── CONFIG ──────────────────────────────────────────────────────────────────── INPUT_PATH = "input.pdf" OUTPUT_DIR = "./redacted" PATTERNS = [] # List of regex patterns or exact strings CATEGORIES = [] # Built-in pattern categories REDACT_COLOR = (0, 0, 0) # RGB fill color (black) WHOLE_WORD = False # Match whole words only for exact string patterns # ───────────────────────────────────────────────────────────────────────────── BUILTIN_PATTERNS = { "email": r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}", "phone": r"(\+?\d[\d\s\-().]{7,}\d)", "ssn": r"\b\d{3}-\d{2}-\d{4}\b", "credit": r"\b(?:\d[ -]?){13,16}\b", "postcode": r"\b[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}\b", "date": r"\b(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4}|\d{4}[\/\-]\d{2}[\/\-]\d{2})\b", } def build_patterns(raw_patterns: list[str], categories: list[str], whole_word: bool) -> list[tuple[str, re.Pattern]]: compiled = [] for cat in categories: if cat not in BUILTIN_PATTERNS: print(f" [WARN] Unknown category: '{cat}'. Available: {list(BUILTIN_PATTERNS)}") continue compiled.append((f"[{cat}]", re.compile(BUILTIN_PATTERNS[cat], re.IGNORECASE))) for pat in raw_patterns: try: # Check if it's a valid regex; if not, escape it as a literal re.compile(pat) if whole_word: pat = rf"\b{re.escape(pat)}\b" compiled.append((pat, re.compile(pat, re.IGNORECASE))) except re.error: escaped = re.escape(pat) compiled.append((pat, re.compile(escaped, re.IGNORECASE))) return compiled def redact_pdf(pdf_path: Path, out_path: Path, patterns: list[tuple[str, re.Pattern]]) -> dict: result = { "file": pdf_path.name, "pages": 0, "redactions": 0, "log": [], "error": "", } try: doc = fitz.open(pdf_path) result["pages"] = len(doc) for page_num, page in enumerate(doc, 1): page_text = page.get_text() page_redact = 0 for label, pattern in patterns: for match in pattern.finditer(page_text): matched_text = match.group() # Search for all instances of the matched text on the page areas = page.search_for(matched_text, quads=False) for rect in areas: page.add_redact_annot(rect, fill=REDACT_COLOR) page_redact += 1 result["log"].append({ "page": page_num, "pattern": label, "matched": matched_text, }) if page_redact > 0: page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE) result["redactions"] += page_redact out_path.parent.mkdir(parents=True, exist_ok=True) doc.save(out_path, garbage=4, deflate=True, clean=True) doc.close() except Exception as e: result["error"] = str(e) return result def main(): parser = argparse.ArgumentParser(description="Permanently redact text from PDF files.") parser.add_argument("--input", default=INPUT_PATH, help="PDF file or folder of PDFs") parser.add_argument("--output-dir", default=OUTPUT_DIR) parser.add_argument("--patterns", nargs="*", default=PATTERNS, help="Regex patterns or exact strings to redact") parser.add_argument("--categories", nargs="*", default=CATEGORIES, choices=list(BUILTIN_PATTERNS.keys()), help="Built-in pattern categories") parser.add_argument("--whole-word", action="store_true", default=WHOLE_WORD, help="Match whole words only for string patterns") args = parser.parse_args() if not args.patterns and not args.categories: sys.exit("[ERROR] Specify at least one --patterns value or --categories option.") src = Path(args.input) if not src.exists(): sys.exit(f"[ERROR] Not found: {src}") pdfs = sorted(src.glob("*.pdf")) if src.is_dir() else [src] out_dir = Path(args.output_dir) out_dir.mkdir(parents=True, exist_ok=True) patterns = build_patterns(args.patterns or [], args.categories or [], args.whole_word) print(f"Patterns : {[label for label, _ in patterns]}") print(f"Files : {len(pdfs)}\n") print("NOTE: Verify all output files before distributing.\n") all_logs = [] for pdf_path in pdfs: out_path = out_dir / f"{pdf_path.stem}_redacted.pdf" result = redact_pdf(pdf_path, out_path, patterns) if result["error"]: print(f" ✗ {pdf_path.name:50s} ERROR — {result['error']}") else: print(f" ✓ {pdf_path.name:50s} {result['redactions']:>4} redaction(s) → {out_path.name}") for entry in result["log"]: entry["file"] = pdf_path.name all_logs.append(entry) # Write redaction log if all_logs: import pandas as pd log_path = out_dir / "_redaction_log.csv" pd.DataFrame(all_logs).to_csv(log_path, index=False) print(f"\nRedaction log : {log_path.resolve()}") print(f"Output dir : {out_dir.resolve()}") if __name__ == "__main__": main()
Powered by the pymupdf library, it can accurately obtain the coordinate bounding box of text and realize real permanent content removal at the file stream level, thoroughly avoiding hidden dangers of information leakage.
5. Extract PDF Metadata & Generate Inventory: Manage PDF Libraries Efficiently
When managing a large number of PDF files, you need to check basic information one by one: page count, file size, creation time, author, encryption status, and whether the file is searchable text or scanned image. Checking manually is completely impractical for mass files.
What this script does
It scans all PDFs in a specified folder in batches, extracts comprehensive metadata of each file, and compiles all information into a unified CSV or Excel inventory file. The extracted information includes:
- Basic attributes: page count, file size, creation & modification time, author and producer.
- Security attributes: encryption status (encrypted files will be specially marked instead of being skipped silently).
- Content attributes: distinguish searchable text PDFs and scanned image PDFs by sampling page content.
The final inventory also adds a summary row with total data and average values to help you quickly sort out the overall status of the PDF library.
""" pdf_inventory.py Scan a folder of PDF files and extract metadata into a single inventory file. Dependencies: pypdf, pdfplumber, pandas, openpyxl Install: pip install pypdf pdfplumber pandas openpyxl Usage: python pdf_inventory.py --input ./documents python pdf_inventory.py --input ./documents --output inventory.xlsx --sample-pages 3 """ import argparse import sys from datetime import datetime from pathlib import Path import pandas as pd import pdfplumber from openpyxl import load_workbook from openpyxl.styles import Font, PatternFill, Alignment from openpyxl.utils import get_column_letter from pypdf import PdfReader from pypdf.errors import PdfReadError # ── CONFIG ──────────────────────────────────────────────────────────────────── INPUT_FOLDER = "./documents" OUTPUT_FILE = "pdf_inventory.xlsx" SAMPLE_PAGES = 3 # Number of pages to sample for text detection RECURSIVE = False # Search subdirectories # ───────────────────────────────────────────────────────────────────────────── HEADER_COLOR = "1F4E79" ENCRYPT_COLOR = "FFC7CE" SCAN_COLOR = "FFEB9C" ALT_ROW_COLOR = "DEEAF1" def parse_pdf_date(raw: str) -> str: """Parse PDF date string (D:YYYYMMDDHHmmSS) to readable format.""" if not raw: return "" raw = str(raw).strip().lstrip("D:").replace("'", "") for fmt in ("%Y%m%d%H%M%S%z", "%Y%m%d%H%M%S", "%Y%m%d"): try: return datetime.strptime(raw[:len(fmt.replace("%z",""))], fmt).strftime("%Y-%m-%d %H:%M") except ValueError: continue return raw[:16] def is_scanned(pdf_path: Path, sample_pages: int) -> bool: """Return True if the sampled pages contain no extractable text.""" try: with pdfplumber.open(pdf_path) as pdf: pages_to_check = pdf.pages[:sample_pages] for page in pages_to_check: text = page.extract_text() or "" if text.strip(): return False return True except Exception: return False def inspect_pdf(pdf_path: Path, sample_pages: int) -> dict: result = { "filename": pdf_path.name, "path": str(pdf_path.resolve()), "size_kb": round(pdf_path.stat().st_size / 1024, 2), "pages": None, "encrypted": False, "scanned": False, "title": "", "author": "", "creator": "", "producer": "", "created": "", "modified": "", "pdf_version": "", "error": "", } try: reader = PdfReader(pdf_path) if reader.is_encrypted: result["encrypted"] = True # Try empty password try: reader.decrypt("") except Exception: result["error"] = "Encrypted — could not open" return result result["pages"] = len(reader.pages) result["pdf_version"] = reader.pdf_header if hasattr(reader, "pdf_header") else "" meta = reader.metadata or {} result["title"] = str(meta.get("/Title", "")).strip() result["author"] = str(meta.get("/Author", "")).strip() result["creator"] = str(meta.get("/Creator", "")).strip() result["producer"] = str(meta.get("/Producer", "")).strip() result["created"] = parse_pdf_date(str(meta.get("/CreationDate", ""))) result["modified"] = parse_pdf_date(str(meta.get("/ModDate", ""))) result["scanned"] = is_scanned(pdf_path, sample_pages) except PdfReadError as e: result["error"] = f"PdfReadError: {e}" except Exception as e: result["error"] = str(e) return result def style_wb(path: Path, records: list[dict]) -> None: wb = load_workbook(path) ws = wb.active header_fill = PatternFill("solid", fgColor=HEADER_COLOR) encrypt_fill = PatternFill("solid", fgColor=ENCRYPT_COLOR) scan_fill = PatternFill("solid", fgColor=SCAN_COLOR) alt_fill = PatternFill("solid", fgColor=ALT_ROW_COLOR) for cell in ws[1]: cell.font = Font(bold=True, color="FFFFFF") cell.fill = header_fill cell.alignment = Alignment(horizontal="center") headers = [cell.value for cell in ws[1]] enc_col = headers.index("encrypted") + 1 if "encrypted" in headers else None scn_col = headers.index("scanned") + 1 if "scanned" in headers else None for row_idx, row in enumerate(ws.iter_rows(min_row=2), start=0): is_enc = ws.cell(row=row_idx + 2, column=enc_col).value if enc_col else False is_scn = ws.cell(row=row_idx + 2, column=scn_col).value if scn_col else False if is_enc: fill = encrypt_fill elif is_scn: fill = scan_fill elif row_idx % 2 == 0: fill = alt_fill else: fill = None if fill: for cell in row: cell.fill = fill for col_idx in range(1, ws.max_column + 1): ws.column_dimensions[get_column_letter(col_idx)].width = 22 ws.freeze_panes = "A2" # Summary sheet if "Summary" in wb.sheetnames: ws_sum = wb["Summary"] for cell in ws_sum[1]: cell.font = Font(bold=True, color="FFFFFF") cell.fill = header_fill wb.save(path) def main(): parser = argparse.ArgumentParser(description="Generate a metadata inventory of PDF files.") parser.add_argument("--input", default=INPUT_FOLDER, help="Folder containing PDF files") parser.add_argument("--output", default=OUTPUT_FILE) parser.add_argument("--sample-pages", type=int, default=SAMPLE_PAGES, help="Pages to sample for scanned-image detection") parser.add_argument("--recursive", action="store_true", default=RECURSIVE, help="Search subdirectories") args = parser.parse_args() folder = Path(args.input) if not folder.exists(): sys.exit(f"[ERROR] Folder not found: {folder}") glob_pattern = "**/*.pdf" if args.recursive else "*.pdf" pdfs = sorted(folder.glob(glob_pattern)) if not pdfs: sys.exit(f"[ERROR] No PDF files found in: {folder}") print(f"Found {len(pdfs):,} PDF file(s)\n") records = [] for i, pdf_path in enumerate(pdfs, 1): print(f" [{i}/{len(pdfs)}] {pdf_path.name}") record = inspect_pdf(pdf_path, args.sample_pages) records.append(record) flags = [] if record["encrypted"]: flags.append("ENCRYPTED") if record["scanned"]: flags.append("SCANNED") if record["error"]: flags.append(f"ERROR: {record['error']}") if flags: print(f" ⚠ {', '.join(flags)}") df = pd.DataFrame(records) # Summary stats total_size = df["size_kb"].sum() summary = pd.DataFrame([ {"Metric": "Total files", "Value": len(df)}, {"Metric": "Total size (KB)", "Value": round(total_size, 2)}, {"Metric": "Total size (MB)", "Value": round(total_size / 1024, 2)}, {"Metric": "Total pages", "Value": df["pages"].sum()}, {"Metric": "Avg pages/file", "Value": round(df["pages"].mean(), 1)}, {"Metric": "Encrypted files", "Value": df["encrypted"].sum()}, {"Metric": "Scanned (image) files", "Value": df["scanned"].sum()}, {"Metric": "Files with errors", "Value": (df["error"] != "").sum()}, ]) print(f"\nTotal files : {len(df)}") print(f"Total pages : {df['pages'].sum()}") print(f"Encrypted : {df['encrypted'].sum()}") print(f"Scanned : {df['scanned'].sum()}") out = Path(args.output) with pd.ExcelWriter(out, engine="openpyxl") as writer: df.to_excel(writer, sheet_name="Inventory", index=False) summary.to_excel(writer, sheet_name="Summary", index=False) style_wb(out, records) print(f"\nOutput written to: {out.resolve()}") print(" Yellow rows = scanned/image PDFs (no extractable text)") print(" Red rows = encrypted files") if __name__ == "__main__": main()
It uses pypdf to read PDF native metadata and pdfplumber to detect page text content, ensuring accurate identification of different types of PDF files.
Final Thoughts: Start Your PDF Automation Journey
These five Python scripts cover the most mainstream PDF processing scenarios in daily work. They share two core advantages: they run independently on the command line, support full-folder batch processing, and always generate new files without overwriting the original documents, which is safe and reliable.
For beginners, you only need to install the corresponding dependent libraries (pypdf, pdfplumber, reportlab, pymupdf) and simply adjust the file paths and configuration parameters in the scripts. We recommend testing with a small number of sample PDFs first to verify the output effect, then expanding to large-scale file processing.
Whether you are an office worker, data analyst, developer or document manager, these automation scripts can cut down hours of repetitive manual work, reduce error rates, and greatly improve overall work efficiency. Stop wasting time on tedious PDF operations — let Python handle the boring work for you.
