abbazs

Posted on Dec 24, 2024

Replace Text in PDFs Using Python

#pdf #pymupdf

Introduction

Manipulating PDFs can be a challenging task due to their complex structure, but with Python and the PyMuPDF library, you can perform tasks like searching for text, replacing it, and saving the modified PDF. In this tutorial, we’ll create a Python CLI tool that allows you to find and replace text in a PDF, while preserving the original font, size, and style as closely as possible.

Prerequisites

Before you begin, ensure you have the following installed:

Python: Version 3.7 or above.
pip: Python's package manager.
PyMuPDF: A Python library for working with PDFs.

Install PyMuPDF using pip:

pip install pymupdf

Additionally, we’ll use the click library to create a user-friendly command-line interface (CLI):

pip install click

Code Walkthrough

Here’s the complete code for our CLI tool:

import click
from pathlib import Path
import fitz  # PyMuPDF

@click.command()
@click.argument("input_pdf", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.argument("output_pdf", type=click.Path(dir_okay=False, writable=True, path_type=Path))
@click.argument("find_text", type=str)
@click.argument("replace_text", type=str)
def replace_text_in_pdf(input_pdf: Path, output_pdf: Path, find_text: str, replace_text: str):
    """
    Replace FIND_TEXT with REPLACE_TEXT in INPUT_PDF and save the result to OUTPUT_PDF.
    """
    # Open the input PDF
    doc = fitz.open(str(input_pdf))

    for page_num, page in enumerate(doc, start=1):
        # Search for occurrences of find_text
        instances = page.search_for(find_text)

        if not instances:
            click.echo(f"No occurrences of '{find_text}' found on page {page_num}.")
            continue

        click.echo(f"Found {len(instances)} occurrences on page {page_num}. Replacing...")

        for rect in instances:
            # First, redact (remove) the original text
            page.add_redact_annot(rect)
            page.apply_redactions()

            # Default values for text properties
            font = "helv"  # Default to Helvetica
            font_size = 7.0  # Default size
            color = (0, 0, 0)  # Default to black

            # Normalize the color values to range 0 to 1
            normalized_color = tuple(c / 255 for c in color) if isinstance(color, tuple) else (0, 0, 0)

            # Calculate the baseline position for text insertion
            baseline = fitz.Point(rect.x0, rect.y1 - 2.2)  # Adjust the -2 offset as needed

            # Insert the new text at the adjusted position
            page.insert_text(
                baseline,
                replace_text,
                fontsize=font_size,
                fontname=font,
                color=normalized_color,
            )
            click.echo(f"Replaced '{find_text}' with '{replace_text}' on page {page_num}.")

    # Save the modified PDF
    doc.save(str(output_pdf))
    click.echo(f"Modified PDF saved to {output_pdf}.")

if __name__ == "__main__":
    replace_text_in_pdf()

How It Works

Searching for Text:
The page.search_for(find_text) method identifies all occurrences of the specified text and returns their bounding rectangles.
Redacting Original Text:
The page.add_redact_annot(rect) and page.apply_redactions() methods remove the original text from the PDF without leaving artifacts.
Inserting Replacement Text:
Using page.insert_text(), we add the replacement text at the same location as the original, maintaining as much visual similarity as possible.
Saving the PDF:
Finally, the modified document is saved to the specified output file.

Running the Tool

Save the code to a file, e.g., replace_text_pdf.py. Then, run it from the terminal as follows:

python replace_text_pdf.py input.pdf output.pdf "find_text" "replace_text"

Example

Suppose you have a PDF named example.pdf with the word Python in it, and you want to replace it with PyMuPDF. Run:

python replace_text_pdf.py example.pdf modified_example.pdf "Python" "PyMuPDF"

Important Notes

Font and Style:
- The tool assumes Helvetica (helv) as the default font.
- You can customize the font and style by extracting properties from the PDF, though it’s not guaranteed to perfectly match due to PDF limitations.
PDF Structure:
- PDFs are not inherently designed for text editing. This tool works best with text-based PDFs, not scanned images or PDFs with embedded fonts.
Testing:
- Always back up your original PDF before using this tool.

Conclusion

With Python and PyMuPDF, replacing text in a PDF is straightforward and powerful. This tutorial covered a CLI tool that can be extended further to suit specific needs. Try it out, and let us know how it works for you in the comments!

Happy coding! 🚀

DEV Community

Replace Text in PDFs Using Python

Introduction

Prerequisites

Code Walkthrough

How It Works

Running the Tool

Example

Important Notes

Conclusion

Top comments (0)

Read next

5 Must-Know Open-Source Tools for DevOps and MLOps Developers🔥🚀

Entra ID Hybrid joined: SSO and understanding PRT- Part 1

Building Scalable and Maintainable Apps with Flutter and Clean Architecture

Understanding Key Concepts in Cloud Computing