ALGOGENE | Stop Manual PDF Work! 5 Powerful Python Scripts to Automate Common PDF Operations

Announcement Career Economy & Market How it work Programming Quantitative Model Trading Strategy

Duval LC

Stop Manual PDF Work! 5 Powerful Python Scripts to Automate Common PDF Operations

Programming

If you regularly work with PDF files for office work, data analysis, report sorting or document management, you must be familiar with endless repetitive operations: merging scattered PDF reports, splitting oversized PDF files by pages, extracting text and tables from scanned documents, adding uniform watermarks to batches of files, or erasing sensitive information before sharing files. These manual PDF operations are not only time-consuming but also prone to human errors, especially when processing dozens or hundreds of PDF files in batches.

Luckily, Python offers a simple and efficient solution. With a set of ready-to-use scripts, you can fully automate almost all daily PDF workflows. Today, we will break down 5 practical Python scripts tailored for mainstream PDF tasks. All scripts support command line running and batch processing, require no complex development skills, and will completely revolutionize how you handle PDF files.

1. Merge & Split PDF Files: Handle Bulk PDFs in One Click

Manually combining multiple PDFs or splitting a large PDF page by page is one of the most common and annoying PDF tasks, especially for files with hundreds of pages or a whole folder of scattered PDF documents.

What this script does

This all-in-one script supports two core modes via a simple mode flag: merge and split.

Merge mode: Automatically reads all PDFs in a target folder, sorts files by filename (or a custom order defined in a text file), and combines them into a single complete PDF. It also retains metadata from the first original file to ensure document integrity.
Split mode: Splits a single large PDF flexibly. You can set fixed page chunks (split every N pages), custom page ranges, or specific single pages. Each split segment will be saved as an independent numbered PDF file.

Sample Script

"""
pdf_merge_split.py
Merge multiple PDF files into one, or split a PDF by page ranges or chunk size.

Dependencies: pypdf
Install:      pip install pypdf

Usage — merge:
    python pdf_merge_split.py merge --input ./pdfs --output merged.pdf
    python pdf_merge_split.py merge --input ./pdfs --output merged.pdf --order order.txt

Usage — split:
    python pdf_merge_split.py split --input report.pdf --output-dir ./splits --every 10
    python pdf_merge_split.py split --input report.pdf --output-dir ./splits --ranges "1-5,6-12,13-"
    python pdf_merge_split.py split --input report.pdf --output-dir ./splits --on-pages 10 20 35

order.txt format (merge mode):
    One filename per line, in the order they should be merged:
        chapter1.pdf
        chapter2.pdf
        appendix.pdf
"""

import argparse
import sys
from pathlib import Path

from pypdf import PdfReader, PdfWriter

# ── CONFIG ────────────────────────────────────────────────────────────────────
INPUT_FOLDER = "./pdfs"
OUTPUT_FILE  = "merged.pdf"
INPUT_FILE   = "input.pdf"
OUTPUT_DIR   = "./splits"
CHUNK_SIZE   = 10           # Pages per split file (--every mode)
ORDER_FILE   = None         # Text file with filenames in merge order
# ─────────────────────────────────────────────────────────────────────────────


def merge(input_folder: str, output_file: str, order_file: str | None) -> None:
    folder = Path(input_folder)
    if not folder.exists():
        sys.exit(f"[ERROR] Input folder not found: {folder}")

    all_pdfs = sorted(folder.glob("*.pdf"))
    if not all_pdfs:
        sys.exit(f"[ERROR] No PDF files found in: {folder}")

    # Apply custom order if provided
    if order_file:
        opath = Path(order_file)
        if not opath.exists():
            sys.exit(f"[ERROR] Order file not found: {opath}")
        names = [line.strip() for line in opath.read_text().splitlines() if line.strip()]
        ordered = []
        for name in names:
            match = folder / name
            if match.exists():
                ordered.append(match)
            else:
                print(f"  [WARN] File in order list not found: {name}")
        # Append any files not in order list at the end
        listed = set(ordered)
        for p in all_pdfs:
            if p not in listed:
                ordered.append(p)
                print(f"  [WARN] '{p.name}' not in order file — appending at end")
        all_pdfs = ordered

    print(f"Merging {len(all_pdfs)} file(s) into: {output_file}\n")

    writer = PdfWriter()
    total_pages = 0

    for pdf_path in all_pdfs:
        try:
            reader = PdfReader(pdf_path)
            pages  = len(reader.pages)
            for page in reader.pages:
                writer.add_page(page)
            total_pages += pages
            print(f"  ✓ {pdf_path.name:50s} {pages:>4} pages")
        except Exception as e:
            print(f"  ✗ {pdf_path.name:50s} FAILED — {e}")

    # Copy metadata from first readable file
    try:
        first_reader = PdfReader(all_pdfs[0])
        if first_reader.metadata:
            writer.add_metadata(dict(first_reader.metadata))
    except Exception:
        pass

    out = Path(output_file)
    out.parent.mkdir(parents=True, exist_ok=True)
    with open(out, "wb") as fh:
        writer.write(fh)

    print(f"\nTotal pages : {total_pages:,}")
    print(f"Output      : {out.resolve()}")


def parse_ranges(ranges_str: str, total_pages: int) -> list[tuple[int, int]]:
    """Parse '1-5,6-12,13-' into list of (start, end) tuples (0-based, exclusive end)."""
    segments = []
    for part in ranges_str.split(","):
        part = part.strip()
        if not part:
            continue
        if "-" in part:
            left, right = part.split("-", 1)
            start = int(left.strip()) - 1 if left.strip() else 0
            end   = int(right.strip()) if right.strip() else total_pages
        else:
            start = int(part) - 1
            end   = int(part)
        segments.append((max(0, start), min(total_pages, end)))
    return segments


def split(input_file: str, output_dir: str, every: int | None,
          ranges: str | None, on_pages: list[int] | None) -> None:
    src = Path(input_file)
    if not src.exists():
        sys.exit(f"[ERROR] File not found: {src}")

    reader     = PdfReader(src)
    total      = len(reader.pages)
    stem       = src.stem
    out_dir    = Path(output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    print(f"Splitting: {src} ({total} pages)\n")

    # Build list of (start, end) segments — 0-based, exclusive end
    if ranges:
        segments = parse_ranges(ranges, total)
    elif on_pages:
        # Split ON these page numbers (1-based): break before each listed page
        breaks = sorted(set([0] + [p - 1 for p in on_pages] + [total]))
        segments = [(breaks[i], breaks[i + 1]) for i in range(len(breaks) - 1)]
    elif every:
        segments = [(i, min(i + every, total)) for i in range(0, total, every)]
    else:
        sys.exit("[ERROR] Specify --every, --ranges, or --on-pages.")

    print(f"Segments: {len(segments)}\n")

    for idx, (start, end) in enumerate(segments, 1):
        if start >= end:
            continue
        writer = PdfWriter()
        for page_idx in range(start, end):
            writer.add_page(reader.pages[page_idx])

        out_name = f"{stem}_part{idx:03d}_pages{start+1}-{end}.pdf"
        out_path = out_dir / out_name
        with open(out_path, "wb") as fh:
            writer.write(fh)
        print(f"  Part {idx:03d}: pages {start+1}–{end}  →  {out_name}")

    print(f"\nOutput directory: {out_dir.resolve()}")


def main():
    parser = argparse.ArgumentParser(description="Merge or split PDF files.")
    sub    = parser.add_subparsers(dest="command", required=True)

    # Merge subcommand
    mp = sub.add_parser("merge", help="Merge multiple PDFs into one")
    mp.add_argument("--input",   default=INPUT_FOLDER, help="Folder containing PDFs")
    mp.add_argument("--output",  default=OUTPUT_FILE,  help="Output PDF path")
    mp.add_argument("--order",   default=ORDER_FILE,   help="Text file with filenames in order")

    # Split subcommand
    sp = sub.add_parser("split", help="Split a PDF into parts")
    sp.add_argument("--input",      default=INPUT_FILE, help="PDF file to split")
    sp.add_argument("--output-dir", default=OUTPUT_DIR)
    sp.add_argument("--every",      type=int, help="Split every N pages")
    sp.add_argument("--ranges",     help="Page ranges, e.g. '1-5,6-12,13-'")
    sp.add_argument("--on-pages",   type=int, nargs="+",
                    help="Split before these page numbers (1-based)")

    args = parser.parse_args()

    if args.command == "merge":
        merge(args.input, args.output, args.order)
    else:
        split(args.input, args.output_dir, args.every, args.ranges, args.on_pages)


if __name__ == "__main__":
    main()

The script relies on the pypdf library to implement all page-level operations. It runs stably, consumes low system resources, and is suitable for both small individual files and large-scale batch processing. No need to open any PDF editing software throughout the process.

2. Extract Text & Tables: Get Structured Data from PDFs Quickly

Copying text or tables directly from PDFs often leads to messy formats, garbled content or missing table data. For work scenarios such as data statistics, report sorting and content archiving, manual extraction is almost inefficient.

What this script does

This script can batch extract text and table content from single or multiple PDFs and export data to standard structured files: Pure text content can be saved as plain text or Markdown files, retaining basic layout.

Extracted tables are automatically exported to CSV or Excel files, with each independent table stored in a separate sheet. It also generates a detailed summary report, marking the total pages, the number of detected tables in each PDF, and pages with empty extraction results for easy troubleshooting.

Sample Script

"""
pdf_extractor.py
Extract text and tables from PDF files into structured output files.

Dependencies: pypdf, pdfplumber, pandas, openpyxl
Install:      pip install pypdf pdfplumber pandas openpyxl

Usage:
    python pdf_extractor.py --input report.pdf
    python pdf_extractor.py --input report.pdf --mode tables --output-dir ./extracted
    python pdf_extractor.py --input ./pdfs --mode both --format xlsx
"""

import argparse
import sys
from pathlib import Path

import pandas as pd
import pdfplumber
from pypdf import PdfReader

# ── CONFIG ────────────────────────────────────────────────────────────────────
INPUT_PATH   = "input.pdf"      # Single PDF or folder of PDFs
OUTPUT_DIR   = "./extracted"
MODE         = "both"           # text | tables | both
TEXT_FORMAT  = "txt"            # txt | md  (for text output)
TABLE_FORMAT = "xlsx"           # csv | xlsx (for table output)
# ─────────────────────────────────────────────────────────────────────────────


def extract_text_pypdf(pdf_path: Path) -> str:
    """Fallback text extraction using pypdf."""
    reader = PdfReader(pdf_path)
    parts  = []
    for i, page in enumerate(reader.pages):
        text = page.extract_text() or ""
        if text.strip():
            parts.append(f"--- Page {i + 1} ---\n{text}")
    return "\n\n".join(parts)


def extract_text_pdfplumber(pdf_path: Path) -> tuple[str, int, int]:
    """Extract text with pdfplumber; returns (text, page_count, empty_page_count)."""
    parts       = []
    empty_pages = 0

    with pdfplumber.open(pdf_path) as pdf:
        page_count = len(pdf.pages)
        for i, page in enumerate(pdf.pages):
            text = page.extract_text(x_tolerance=2, y_tolerance=2) or ""
            if text.strip():
                parts.append(f"--- Page {i + 1} ---\n{text.strip()}")
            else:
                empty_pages += 1

    return "\n\n".join(parts), page_count, empty_pages


def extract_tables(pdf_path: Path) -> list[dict]:
    """Extract all tables from a PDF. Returns list of {page, table_index, df}."""
    results = []

    table_settings = {
        "vertical_strategy":   "lines",
        "horizontal_strategy": "lines",
        "snap_tolerance":      3,
    }

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, 1):
            # Try structured line-based detection first
            tables = page.extract_tables(table_settings)
            if not tables:
                # Fall back to text-based detection
                tables = page.extract_tables()

            for t_idx, table in enumerate(tables):
                if not table or len(table) < 2:
                    continue

                # Use first row as header if it looks like one
                header = table[0]
                data   = table[1:]

                # Clean up: replace None with empty string
                header = [str(h).strip() if h else f"col_{i}" for i, h in enumerate(header)]
                rows   = [[str(c).strip() if c is not None else "" for c in row] for row in data]

                # Remove fully empty rows
                rows = [r for r in rows if any(r)]
                if not rows:
                    continue

                df = pd.DataFrame(rows, columns=header)
                results.append({
                    "page":        page_num,
                    "table_index": t_idx + 1,
                    "df":          df,
                })

    return results


def write_text(text: str, out_path: Path, fmt: str) -> None:
    if fmt == "md":
        # Wrap page separators as markdown headers
        text = text.replace("--- Page ", "## Page ").replace(" ---", "")
        out_path = out_path.with_suffix(".md")
    out_path.write_text(text, encoding="utf-8")


def write_tables(tables: list[dict], out_path: Path, fmt: str) -> None:
    if not tables:
        return

    if fmt == "xlsx":
        with pd.ExcelWriter(out_path.with_suffix(".xlsx"), engine="openpyxl") as writer:
            for t in tables:
                sheet = f"P{t['page']}_T{t['table_index']}"[:31]
                t["df"].to_excel(writer, sheet_name=sheet, index=False)
    else:
        # One CSV per table
        for t in tables:
            csv_path = out_path.parent / f"{out_path.stem}_p{t['page']}_t{t['table_index']}.csv"
            t["df"].to_csv(csv_path, index=False, encoding="utf-8-sig")


def process_file(pdf_path: Path, out_dir: Path, mode: str,
                 text_fmt: str, table_fmt: str) -> dict:
    stem   = pdf_path.stem
    result = {
        "file":        pdf_path.name,
        "pages":       0,
        "empty_pages": 0,
        "tables":      0,
        "text_chars":  0,
        "error":       "",
    }

    try:
        if mode in ("text", "both"):
            text, page_count, empty_pages = extract_text_pdfplumber(pdf_path)
            if not text.strip():
                # Fall back to pypdf
                text        = extract_text_pypdf(pdf_path)
                empty_pages = 0

            result["pages"]       = page_count
            result["empty_pages"] = empty_pages
            result["text_chars"]  = len(text)

            if text.strip():
                write_text(text, out_dir / f"{stem}_text.txt", text_fmt)
            else:
                result["error"] += "No text extracted (possibly scanned). "

        if mode in ("tables", "both"):
            tables = extract_tables(pdf_path)
            result["tables"] = len(tables)
            if tables:
                write_tables(tables, out_dir / f"{stem}_tables", table_fmt)

    except Exception as e:
        result["error"] += str(e)

    return result


def main():
    parser = argparse.ArgumentParser(description="Extract text and tables from PDF files.")
    parser.add_argument("--input",        default=INPUT_PATH,
                        help="PDF file or folder of PDFs")
    parser.add_argument("--output-dir",   default=OUTPUT_DIR)
    parser.add_argument("--mode",         default=MODE,
                        choices=["text", "tables", "both"])
    parser.add_argument("--text-format",  default=TEXT_FORMAT,  choices=["txt", "md"])
    parser.add_argument("--table-format", default=TABLE_FORMAT, choices=["csv", "xlsx"])
    args = parser.parse_args()

    src = Path(args.input)
    if not src.exists():
        sys.exit(f"[ERROR] Not found: {src}")

    out_dir = Path(args.output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    pdfs = sorted(src.glob("*.pdf")) if src.is_dir() else [src]
    if not pdfs:
        sys.exit(f"[ERROR] No PDF files found in: {src}")

    print(f"Processing {len(pdfs)} file(s) | Mode: {args.mode}\n")

    summary = []
    for pdf_path in pdfs:
        print(f"  {pdf_path.name}")
        result = process_file(pdf_path, out_dir, args.mode,
                              args.text_format, args.table_format)
        summary.append(result)

        if result["error"]:
            print(f"    [WARN] {result['error']}")
        else:
            parts = []
            if args.mode in ("text", "both"):
                parts.append(f"{result['pages']} pages, {result['text_chars']:,} chars")
            if args.mode in ("tables", "both"):
                parts.append(f"{result['tables']} table(s)")
            print(f"    {' | '.join(parts)}")

    summary_path = out_dir / "_summary.csv"
    pd.DataFrame(summary).to_csv(summary_path, index=False)
    print(f"\nOutput directory : {out_dir.resolve()}")
    print(f"Summary          : {summary_path.name}")


if __name__ == "__main__":
    main()

It combines two mainstream libraries: use pypdf for basic text extraction, and pdfplumber for layout-preserving extraction and accurate table recognition. The script will automatically clean data (remove empty rows, identify table headers) to output usable standardized data directly.

3. Add Watermarks, Stamps & Page Numbers: Batch Customize PDF Styles

Adding watermarks, official stamps, uniform page numbers or header/footer notes to batches of PDFs is a necessary step before document distribution. Repeating the operation on each file via graphical PDF tools wastes a lot of working time.

What this script does

This script realizes one-click batch stamping for entire PDF folders. It supports diverse overlay effects with fully configurable parameters:

Text watermarks (including diagonal watermarks), custom stamps and image overlays.
Adjustable position, font size, transparency, color and rotation angle of all stamped content.
Automatic page numbering for each page (generating exclusive stamps for different pages).

The original PDF files will not be modified, and all processed files will be saved as new independent documents.

Sample Script

"""
pdf_stamper.py
Add text watermarks, stamps, or page numbers to PDF files in batch.

Dependencies: pypdf, reportlab
Install:      pip install pypdf reportlab

Usage:
    python pdf_stamper.py --input report.pdf --text "CONFIDENTIAL"
    python pdf_stamper.py --input ./pdfs --text "DRAFT" --angle 45 --opacity 0.15
    python pdf_stamper.py --input report.pdf --mode page-numbers --position bottom-center
    python pdf_stamper.py --input report.pdf --text "INTERNAL USE ONLY" --position top-center --angle 0
"""

import argparse
import io
import sys
from pathlib import Path

from pypdf import PdfReader, PdfWriter
from reportlab.lib.colors import Color, HexColor
from reportlab.lib.pagesizes import A4
from reportlab.pdfgen import canvas

# ── CONFIG ────────────────────────────────────────────────────────────────────
INPUT_PATH   = "input.pdf"      # Single PDF or folder of PDFs
OUTPUT_DIR   = "./stamped"
MODE         = "watermark"      # watermark | stamp | page-numbers
STAMP_TEXT   = "CONFIDENTIAL"
POSITION     = "center"         # center | top-left | top-center | top-right |
                                # bottom-left | bottom-center | bottom-right
ANGLE        = 45               # Rotation angle in degrees
OPACITY      = 0.12             # 0.0 (invisible) to 1.0 (fully opaque)
FONT_NAME    = "Helvetica-Bold"
FONT_SIZE    = 48               # For watermark/stamp
PAGE_NUM_SIZE = 10              # Font size for page numbers
COLOR        = "#CCCCCC"        # Hex color for watermark text
STAMP_COLOR  = "#CC0000"        # Hex color for stamp text
PAGE_NUM_FMT = "Page {n} of {total}"  # Page number format string
# ─────────────────────────────────────────────────────────────────────────────


def hex_to_color(hex_str: str, opacity: float) -> Color:
    base = HexColor(hex_str)
    return Color(base.red, base.green, base.blue, alpha=opacity)


def get_position_coords(page_width: float, page_height: float,
                        position: str, font_size: int) -> tuple[float, float]:
    margin = 20
    cx, cy = page_width / 2, page_height / 2

    positions = {
        "center":        (cx, cy),
        "top-left":      (margin, page_height - margin - font_size),
        "top-center":    (cx, page_height - margin - font_size),
        "top-right":     (page_width - margin, page_height - margin - font_size),
        "bottom-left":   (margin, margin),
        "bottom-center": (cx, margin),
        "bottom-right":  (page_width - margin, margin),
    }
    return positions.get(position, positions["center"])


def make_text_stamp(text: str, page_width: float, page_height: float,
                    position: str, angle: float, opacity: float,
                    font: str, font_size: int, color_hex: str) -> bytes:
    """Generate a single-page PDF stamp in memory."""
    buf = io.BytesIO()
    c   = canvas.Canvas(buf, pagesize=(page_width, page_height))
    c.setFillColor(hex_to_color(color_hex, opacity))
    c.setFont(font, font_size)

    x, y = get_position_coords(page_width, page_height, position, font_size)

    c.saveState()
    c.translate(x, y)
    c.rotate(angle)
    c.drawCentredString(0, 0, text)
    c.restoreState()
    c.save()

    buf.seek(0)
    return buf.read()


def make_page_number_stamp(n: int, total: int, page_width: float,
                           page_height: float, position: str,
                           fmt: str, font_size: int) -> bytes:
    text = fmt.format(n=n, total=total)
    buf  = io.BytesIO()
    c    = canvas.Canvas(buf, pagesize=(page_width, page_height))
    c.setFillColor(Color(0, 0, 0, alpha=0.7))
    c.setFont("Helvetica", font_size)
    x, y = get_position_coords(page_width, page_height, position, font_size)
    c.drawCentredString(x, y, text)
    c.save()
    buf.seek(0)
    return buf.read()


def stamp_pdf(pdf_path: Path, out_path: Path, mode: str, text: str,
              position: str, angle: float, opacity: float,
              font: str, font_size: int, color_hex: str,
              page_num_fmt: str, page_num_size: int) -> dict:
    result = {"file": pdf_path.name, "pages": 0, "error": ""}
    try:
        reader = PdfReader(pdf_path)
        writer = PdfWriter()
        total  = len(reader.pages)
        result["pages"] = total

        for i, page in enumerate(reader.pages):
            w = float(page.mediabox.width)
            h = float(page.mediabox.height)

            if mode == "page-numbers":
                stamp_bytes = make_page_number_stamp(
                    i + 1, total, w, h, position, page_num_fmt, page_num_size
                )
                stamp_color = "#000000"
            else:
                stamp_color = STAMP_COLOR if mode == "stamp" else color_hex
                stamp_angle = 0 if mode == "stamp" else angle
                stamp_bytes = make_text_stamp(
                    text, w, h, position, stamp_angle, opacity,
                    font, font_size, stamp_color
                )

            stamp_reader = PdfReader(io.BytesIO(stamp_bytes))
            stamp_page   = stamp_reader.pages[0]

            page.merge_page(stamp_page)
            writer.add_page(page)

        out_path.parent.mkdir(parents=True, exist_ok=True)
        with open(out_path, "wb") as fh:
            writer.write(fh)

    except Exception as e:
        result["error"] = str(e)

    return result


def main():
    parser = argparse.ArgumentParser(description="Add watermarks, stamps, or page numbers to PDFs.")
    parser.add_argument("--input",         default=INPUT_PATH,
                        help="PDF file or folder of PDFs")
    parser.add_argument("--output-dir",    default=OUTPUT_DIR)
    parser.add_argument("--mode",          default=MODE,
                        choices=["watermark", "stamp", "page-numbers"])
    parser.add_argument("--text",          default=STAMP_TEXT,
                        help="Text to stamp or watermark")
    parser.add_argument("--position",      default=POSITION,
                        choices=["center", "top-left", "top-center", "top-right",
                                 "bottom-left", "bottom-center", "bottom-right"])
    parser.add_argument("--angle",         type=float, default=ANGLE)
    parser.add_argument("--opacity",       type=float, default=OPACITY,
                        help="Text opacity 0.0–1.0")
    parser.add_argument("--font-size",     type=int,   default=FONT_SIZE)
    parser.add_argument("--color",         default=COLOR,
                        help="Hex color for watermark text")
    parser.add_argument("--page-num-fmt",  default=PAGE_NUM_FMT,
                        help="Page number format. Use {n} and {total}")
    parser.add_argument("--page-num-size", type=int, default=PAGE_NUM_SIZE)
    args = parser.parse_args()

    src = Path(args.input)
    if not src.exists():
        sys.exit(f"[ERROR] Not found: {src}")

    pdfs    = sorted(src.glob("*.pdf")) if src.is_dir() else [src]
    out_dir = Path(args.output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    print(f"Mode     : {args.mode}")
    if args.mode != "page-numbers":
        print(f"Text     : {args.text}")
    print(f"Position : {args.position}")
    print(f"Files    : {len(pdfs)}\n")

    for pdf_path in pdfs:
        out_path = out_dir / f"{pdf_path.stem}_{args.mode}.pdf"
        result   = stamp_pdf(
            pdf_path, out_path, args.mode, args.text,
            args.position, args.angle, args.opacity,
            FONT_NAME, args.font_size, args.color,
            args.page_num_fmt, args.page_num_size,
        )
        if result["error"]:
            print(f"  ✗ {pdf_path.name:50s} ERROR — {result['error']}")
        else:
            print(f"  ✓ {pdf_path.name:50s} {result['pages']} pages  →  {out_path.name}")

    print(f"\nOutput directory: {out_dir.resolve()}")


if __name__ == "__main__":
    main()

It uses pypdf for PDF page merging and reportlab to dynamically generate stamp layers in memory. The whole process does not generate temporary files, which is efficient and safe.

4. Redact Sensitive Content: Permanently Hide Confidential Information

When sharing PDFs externally, you need to shield sensitive content such as names, phone numbers, email addresses, financial data and addresses. Drawing black boxes manually with PDF editors only covers content visually — the original text can still be extracted by others, bringing data leakage risks.

What this script does

This is a secure automated redaction tool. You can define matching rules in advance, including regular expressions, exact text strings, or preset categories like emails and phone numbers. The script will scan all PDF pages, locate matching sensitive content, and use black rectangles to permanently delete the underlying text, rather than just covering it.

After execution, a detailed redaction log will be generated, recording the page number, original sensitive text and matching rules for every redaction operation, which is convenient for audit and review.

Sample Script

"""
pdf_redactor.py
Permanently redact text matching patterns from PDF files.

Dependencies: pymupdf
Install:      pip install pymupdf

Usage:
    python pdf_redactor.py --input document.pdf --patterns "John Smith" "ACC-\d+"
    python pdf_redactor.py --input document.pdf --categories email phone
    python pdf_redactor.py --input ./pdfs --patterns "CONFIDENTIAL-\d+" --categories email

Built-in categories (--categories):
    email     Email addresses
    phone     Phone numbers (various formats)
    ssn       US Social Security numbers
    credit    Credit card numbers
    postcode  UK postcodes
    date      Common date formats

NOTE: Always verify redaction output before distributing.
      Test on a copy before processing originals.
"""

import argparse
import re
import sys
from pathlib import Path

import fitz  # pymupdf

# ── CONFIG ────────────────────────────────────────────────────────────────────
INPUT_PATH  = "input.pdf"
OUTPUT_DIR  = "./redacted"
PATTERNS    = []           # List of regex patterns or exact strings
CATEGORIES  = []           # Built-in pattern categories
REDACT_COLOR = (0, 0, 0)  # RGB fill color (black)
WHOLE_WORD  = False        # Match whole words only for exact string patterns
# ─────────────────────────────────────────────────────────────────────────────

BUILTIN_PATTERNS = {
    "email":    r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}",
    "phone":    r"(\+?\d[\d\s\-().]{7,}\d)",
    "ssn":      r"\b\d{3}-\d{2}-\d{4}\b",
    "credit":   r"\b(?:\d[ -]?){13,16}\b",
    "postcode": r"\b[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}\b",
    "date":     r"\b(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4}|\d{4}[\/\-]\d{2}[\/\-]\d{2})\b",
}


def build_patterns(raw_patterns: list[str], categories: list[str],
                   whole_word: bool) -> list[tuple[str, re.Pattern]]:
    compiled = []

    for cat in categories:
        if cat not in BUILTIN_PATTERNS:
            print(f"  [WARN] Unknown category: '{cat}'. Available: {list(BUILTIN_PATTERNS)}")
            continue
        compiled.append((f"[{cat}]", re.compile(BUILTIN_PATTERNS[cat], re.IGNORECASE)))

    for pat in raw_patterns:
        try:
            # Check if it's a valid regex; if not, escape it as a literal
            re.compile(pat)
            if whole_word:
                pat = rf"\b{re.escape(pat)}\b"
            compiled.append((pat, re.compile(pat, re.IGNORECASE)))
        except re.error:
            escaped = re.escape(pat)
            compiled.append((pat, re.compile(escaped, re.IGNORECASE)))

    return compiled


def redact_pdf(pdf_path: Path, out_path: Path,
               patterns: list[tuple[str, re.Pattern]]) -> dict:
    result = {
        "file":       pdf_path.name,
        "pages":      0,
        "redactions": 0,
        "log":        [],
        "error":      "",
    }

    try:
        doc = fitz.open(pdf_path)
        result["pages"] = len(doc)

        for page_num, page in enumerate(doc, 1):
            page_text   = page.get_text()
            page_redact = 0

            for label, pattern in patterns:
                for match in pattern.finditer(page_text):
                    matched_text = match.group()
                    # Search for all instances of the matched text on the page
                    areas = page.search_for(matched_text, quads=False)
                    for rect in areas:
                        page.add_redact_annot(rect, fill=REDACT_COLOR)
                        page_redact += 1
                        result["log"].append({
                            "page":    page_num,
                            "pattern": label,
                            "matched": matched_text,
                        })

            if page_redact > 0:
                page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)
                result["redactions"] += page_redact

        out_path.parent.mkdir(parents=True, exist_ok=True)
        doc.save(out_path, garbage=4, deflate=True, clean=True)
        doc.close()

    except Exception as e:
        result["error"] = str(e)

    return result


def main():
    parser = argparse.ArgumentParser(description="Permanently redact text from PDF files.")
    parser.add_argument("--input",       default=INPUT_PATH,
                        help="PDF file or folder of PDFs")
    parser.add_argument("--output-dir",  default=OUTPUT_DIR)
    parser.add_argument("--patterns",    nargs="*", default=PATTERNS,
                        help="Regex patterns or exact strings to redact")
    parser.add_argument("--categories",  nargs="*", default=CATEGORIES,
                        choices=list(BUILTIN_PATTERNS.keys()),
                        help="Built-in pattern categories")
    parser.add_argument("--whole-word",  action="store_true", default=WHOLE_WORD,
                        help="Match whole words only for string patterns")
    args = parser.parse_args()

    if not args.patterns and not args.categories:
        sys.exit("[ERROR] Specify at least one --patterns value or --categories option.")

    src = Path(args.input)
    if not src.exists():
        sys.exit(f"[ERROR] Not found: {src}")

    pdfs    = sorted(src.glob("*.pdf")) if src.is_dir() else [src]
    out_dir = Path(args.output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    patterns = build_patterns(args.patterns or [], args.categories or [], args.whole_word)
    print(f"Patterns  : {[label for label, _ in patterns]}")
    print(f"Files     : {len(pdfs)}\n")
    print("NOTE: Verify all output files before distributing.\n")

    all_logs = []

    for pdf_path in pdfs:
        out_path = out_dir / f"{pdf_path.stem}_redacted.pdf"
        result   = redact_pdf(pdf_path, out_path, patterns)

        if result["error"]:
            print(f"  ✗ {pdf_path.name:50s} ERROR — {result['error']}")
        else:
            print(f"  ✓ {pdf_path.name:50s} {result['redactions']:>4} redaction(s)  →  {out_path.name}")

        for entry in result["log"]:
            entry["file"] = pdf_path.name
            all_logs.append(entry)

    # Write redaction log
    if all_logs:
        import pandas as pd
        log_path = out_dir / "_redaction_log.csv"
        pd.DataFrame(all_logs).to_csv(log_path, index=False)
        print(f"\nRedaction log : {log_path.resolve()}")

    print(f"Output dir    : {out_dir.resolve()}")


if __name__ == "__main__":
    main()

Powered by the pymupdf library, it can accurately obtain the coordinate bounding box of text and realize real permanent content removal at the file stream level, thoroughly avoiding hidden dangers of information leakage.

5. Extract PDF Metadata & Generate Inventory: Manage PDF Libraries Efficiently

When managing a large number of PDF files, you need to check basic information one by one: page count, file size, creation time, author, encryption status, and whether the file is searchable text or scanned image. Checking manually is completely impractical for mass files.

What this script does

It scans all PDFs in a specified folder in batches, extracts comprehensive metadata of each file, and compiles all information into a unified CSV or Excel inventory file. The extracted information includes:

Basic attributes: page count, file size, creation & modification time, author and producer.
Security attributes: encryption status (encrypted files will be specially marked instead of being skipped silently).
Content attributes: distinguish searchable text PDFs and scanned image PDFs by sampling page content.

The final inventory also adds a summary row with total data and average values to help you quickly sort out the overall status of the PDF library.

"""
pdf_inventory.py
Scan a folder of PDF files and extract metadata into a single inventory file.

Dependencies: pypdf, pdfplumber, pandas, openpyxl
Install:      pip install pypdf pdfplumber pandas openpyxl

Usage:
    python pdf_inventory.py --input ./documents
    python pdf_inventory.py --input ./documents --output inventory.xlsx --sample-pages 3
"""

import argparse
import sys
from datetime import datetime
from pathlib import Path

import pandas as pd
import pdfplumber
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment
from openpyxl.utils import get_column_letter
from pypdf import PdfReader
from pypdf.errors import PdfReadError

# ── CONFIG ────────────────────────────────────────────────────────────────────
INPUT_FOLDER  = "./documents"
OUTPUT_FILE   = "pdf_inventory.xlsx"
SAMPLE_PAGES  = 3       # Number of pages to sample for text detection
RECURSIVE     = False   # Search subdirectories
# ─────────────────────────────────────────────────────────────────────────────

HEADER_COLOR   = "1F4E79"
ENCRYPT_COLOR  = "FFC7CE"
SCAN_COLOR     = "FFEB9C"
ALT_ROW_COLOR  = "DEEAF1"


def parse_pdf_date(raw: str) -> str:
    """Parse PDF date string (D:YYYYMMDDHHmmSS) to readable format."""
    if not raw:
        return ""
    raw = str(raw).strip().lstrip("D:").replace("'", "")
    for fmt in ("%Y%m%d%H%M%S%z", "%Y%m%d%H%M%S", "%Y%m%d"):
        try:
            return datetime.strptime(raw[:len(fmt.replace("%z",""))], fmt).strftime("%Y-%m-%d %H:%M")
        except ValueError:
            continue
    return raw[:16]


def is_scanned(pdf_path: Path, sample_pages: int) -> bool:
    """Return True if the sampled pages contain no extractable text."""
    try:
        with pdfplumber.open(pdf_path) as pdf:
            pages_to_check = pdf.pages[:sample_pages]
            for page in pages_to_check:
                text = page.extract_text() or ""
                if text.strip():
                    return False
        return True
    except Exception:
        return False


def inspect_pdf(pdf_path: Path, sample_pages: int) -> dict:
    result = {
        "filename":      pdf_path.name,
        "path":          str(pdf_path.resolve()),
        "size_kb":       round(pdf_path.stat().st_size / 1024, 2),
        "pages":         None,
        "encrypted":     False,
        "scanned":       False,
        "title":         "",
        "author":        "",
        "creator":       "",
        "producer":      "",
        "created":       "",
        "modified":      "",
        "pdf_version":   "",
        "error":         "",
    }

    try:
        reader = PdfReader(pdf_path)

        if reader.is_encrypted:
            result["encrypted"] = True
            # Try empty password
            try:
                reader.decrypt("")
            except Exception:
                result["error"] = "Encrypted — could not open"
                return result

        result["pages"]       = len(reader.pages)
        result["pdf_version"] = reader.pdf_header if hasattr(reader, "pdf_header") else ""

        meta = reader.metadata or {}
        result["title"]    = str(meta.get("/Title",    "")).strip()
        result["author"]   = str(meta.get("/Author",   "")).strip()
        result["creator"]  = str(meta.get("/Creator",  "")).strip()
        result["producer"] = str(meta.get("/Producer", "")).strip()
        result["created"]  = parse_pdf_date(str(meta.get("/CreationDate", "")))
        result["modified"] = parse_pdf_date(str(meta.get("/ModDate",      "")))

        result["scanned"] = is_scanned(pdf_path, sample_pages)

    except PdfReadError as e:
        result["error"] = f"PdfReadError: {e}"
    except Exception as e:
        result["error"] = str(e)

    return result


def style_wb(path: Path, records: list[dict]) -> None:
    wb = load_workbook(path)
    ws = wb.active

    header_fill  = PatternFill("solid", fgColor=HEADER_COLOR)
    encrypt_fill = PatternFill("solid", fgColor=ENCRYPT_COLOR)
    scan_fill    = PatternFill("solid", fgColor=SCAN_COLOR)
    alt_fill     = PatternFill("solid", fgColor=ALT_ROW_COLOR)

    for cell in ws[1]:
        cell.font = Font(bold=True, color="FFFFFF")
        cell.fill = header_fill
        cell.alignment = Alignment(horizontal="center")

    headers = [cell.value for cell in ws[1]]
    enc_col = headers.index("encrypted") + 1 if "encrypted" in headers else None
    scn_col = headers.index("scanned")   + 1 if "scanned"   in headers else None

    for row_idx, row in enumerate(ws.iter_rows(min_row=2), start=0):
        is_enc = ws.cell(row=row_idx + 2, column=enc_col).value if enc_col else False
        is_scn = ws.cell(row=row_idx + 2, column=scn_col).value if scn_col else False

        if is_enc:
            fill = encrypt_fill
        elif is_scn:
            fill = scan_fill
        elif row_idx % 2 == 0:
            fill = alt_fill
        else:
            fill = None

        if fill:
            for cell in row:
                cell.fill = fill

    for col_idx in range(1, ws.max_column + 1):
        ws.column_dimensions[get_column_letter(col_idx)].width = 22
    ws.freeze_panes = "A2"

    # Summary sheet
    if "Summary" in wb.sheetnames:
        ws_sum = wb["Summary"]
        for cell in ws_sum[1]:
            cell.font = Font(bold=True, color="FFFFFF")
            cell.fill = header_fill

    wb.save(path)


def main():
    parser = argparse.ArgumentParser(description="Generate a metadata inventory of PDF files.")
    parser.add_argument("--input",        default=INPUT_FOLDER,
                        help="Folder containing PDF files")
    parser.add_argument("--output",       default=OUTPUT_FILE)
    parser.add_argument("--sample-pages", type=int, default=SAMPLE_PAGES,
                        help="Pages to sample for scanned-image detection")
    parser.add_argument("--recursive",    action="store_true", default=RECURSIVE,
                        help="Search subdirectories")
    args = parser.parse_args()

    folder = Path(args.input)
    if not folder.exists():
        sys.exit(f"[ERROR] Folder not found: {folder}")

    glob_pattern = "**/*.pdf" if args.recursive else "*.pdf"
    pdfs = sorted(folder.glob(glob_pattern))
    if not pdfs:
        sys.exit(f"[ERROR] No PDF files found in: {folder}")

    print(f"Found {len(pdfs):,} PDF file(s)\n")

    records = []
    for i, pdf_path in enumerate(pdfs, 1):
        print(f"  [{i}/{len(pdfs)}] {pdf_path.name}")
        record = inspect_pdf(pdf_path, args.sample_pages)
        records.append(record)
        flags = []
        if record["encrypted"]: flags.append("ENCRYPTED")
        if record["scanned"]:   flags.append("SCANNED")
        if record["error"]:     flags.append(f"ERROR: {record['error']}")
        if flags:
            print(f"    ⚠ {', '.join(flags)}")

    df = pd.DataFrame(records)

    # Summary stats
    total_size = df["size_kb"].sum()
    summary = pd.DataFrame([
        {"Metric": "Total files",       "Value": len(df)},
        {"Metric": "Total size (KB)",   "Value": round(total_size, 2)},
        {"Metric": "Total size (MB)",   "Value": round(total_size / 1024, 2)},
        {"Metric": "Total pages",       "Value": df["pages"].sum()},
        {"Metric": "Avg pages/file",    "Value": round(df["pages"].mean(), 1)},
        {"Metric": "Encrypted files",   "Value": df["encrypted"].sum()},
        {"Metric": "Scanned (image) files", "Value": df["scanned"].sum()},
        {"Metric": "Files with errors", "Value": (df["error"] != "").sum()},
    ])

    print(f"\nTotal files : {len(df)}")
    print(f"Total pages : {df['pages'].sum()}")
    print(f"Encrypted   : {df['encrypted'].sum()}")
    print(f"Scanned     : {df['scanned'].sum()}")

    out = Path(args.output)
    with pd.ExcelWriter(out, engine="openpyxl") as writer:
        df.to_excel(writer, sheet_name="Inventory", index=False)
        summary.to_excel(writer, sheet_name="Summary", index=False)

    style_wb(out, records)
    print(f"\nOutput written to: {out.resolve()}")
    print("  Yellow rows = scanned/image PDFs (no extractable text)")
    print("  Red rows    = encrypted files")


if __name__ == "__main__":
    main()

It uses pypdf to read PDF native metadata and pdfplumber to detect page text content, ensuring accurate identification of different types of PDF files.

Final Thoughts: Start Your PDF Automation Journey

These five Python scripts cover the most mainstream PDF processing scenarios in daily work. They share two core advantages: they run independently on the command line, support full-folder batch processing, and always generate new files without overwriting the original documents, which is safe and reliable.

For beginners, you only need to install the corresponding dependent libraries (pypdf, pdfplumber, reportlab, pymupdf) and simply adjust the file paths and configuration parameters in the scripts. We recommend testing with a small number of sample PDFs first to verify the output effect, then expanding to large-scale file processing.

Whether you are an office worker, data analyst, developer or document manager, these automation scripts can cut down hours of repetitive manual work, reduce error rates, and greatly improve overall work efficiency. Stop wasting time on tedious PDF operations — let Python handle the boring work for you.

1 0

Posted on : 2026-06-14 16:50:13.051047