Introduction
Manipulating PDFs can be a challenging task due to their complex structure, but with Python and the PyMuPDF
library, you can perform tasks like searching for text, replacing it, and saving the modified PDF. In this tutorial, we’ll create a Python CLI tool that allows you to find and replace text in a PDF, while preserving the original font, size, and style as closely as possible.
Prerequisites
Before you begin, ensure you have the following installed:
- Python: Version 3.7 or above.
-
pip
: Python's package manager. -
PyMuPDF
: A Python library for working with PDFs.
Install PyMuPDF
using pip:
pip install pymupdf
Additionally, we’ll use the click
library to create a user-friendly command-line interface (CLI):
pip install click
Code Walkthrough
Here’s the complete code for our CLI tool:
import click
from pathlib import Path
import fitz # PyMuPDF
@click.command()
@click.argument("input_pdf", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.argument("output_pdf", type=click.Path(dir_okay=False, writable=True, path_type=Path))
@click.argument("find_text", type=str)
@click.argument("replace_text", type=str)
def replace_text_in_pdf(input_pdf: Path, output_pdf: Path, find_text: str, replace_text: str):
"""
Replace FIND_TEXT with REPLACE_TEXT in INPUT_PDF and save the result to OUTPUT_PDF.
"""
# Open the input PDF
doc = fitz.open(str(input_pdf))
for page_num, page in enumerate(doc, start=1):
# Search for occurrences of find_text
instances = page.search_for(find_text)
if not instances:
click.echo(f"No occurrences of '{find_text}' found on page {page_num}.")
continue
click.echo(f"Found {len(instances)} occurrences on page {page_num}. Replacing...")
for rect in instances:
# First, redact (remove) the original text
page.add_redact_annot(rect)
page.apply_redactions()
# Default values for text properties
font = "helv" # Default to Helvetica
font_size = 7.0 # Default size
color = (0, 0, 0) # Default to black
# Normalize the color values to range 0 to 1
normalized_color = tuple(c / 255 for c in color) if isinstance(color, tuple) else (0, 0, 0)
# Calculate the baseline position for text insertion
baseline = fitz.Point(rect.x0, rect.y1 - 2.2) # Adjust the -2 offset as needed
# Insert the new text at the adjusted position
page.insert_text(
baseline,
replace_text,
fontsize=font_size,
fontname=font,
color=normalized_color,
)
click.echo(f"Replaced '{find_text}' with '{replace_text}' on page {page_num}.")
# Save the modified PDF
doc.save(str(output_pdf))
click.echo(f"Modified PDF saved to {output_pdf}.")
if __name__ == "__main__":
replace_text_in_pdf()
How It Works
Searching for Text:
Thepage.search_for(find_text)
method identifies all occurrences of the specified text and returns their bounding rectangles.Redacting Original Text:
Thepage.add_redact_annot(rect)
andpage.apply_redactions()
methods remove the original text from the PDF without leaving artifacts.Inserting Replacement Text:
Usingpage.insert_text()
, we add the replacement text at the same location as the original, maintaining as much visual similarity as possible.Saving the PDF:
Finally, the modified document is saved to the specified output file.
Running the Tool
Save the code to a file, e.g., replace_text_pdf.py
. Then, run it from the terminal as follows:
python replace_text_pdf.py input.pdf output.pdf "find_text" "replace_text"
Example
Suppose you have a PDF named example.pdf
with the word Python
in it, and you want to replace it with PyMuPDF
. Run:
python replace_text_pdf.py example.pdf modified_example.pdf "Python" "PyMuPDF"
Important Notes
-
Font and Style:
- The tool assumes Helvetica (
helv
) as the default font. - You can customize the font and style by extracting properties from the PDF, though it’s not guaranteed to perfectly match due to PDF limitations.
- The tool assumes Helvetica (
-
PDF Structure:
- PDFs are not inherently designed for text editing. This tool works best with text-based PDFs, not scanned images or PDFs with embedded fonts.
-
Testing:
- Always back up your original PDF before using this tool.
Conclusion
With Python and PyMuPDF
, replacing text in a PDF is straightforward and powerful. This tutorial covered a CLI tool that can be extended further to suit specific needs. Try it out, and let us know how it works for you in the comments!
Happy coding! 🚀
Top comments (0)