Introducing PdfPig
PdfPig is an open source C # library that allows us to extract text and other content from pdfs. Its a port of the java pdfbox library. You can find more here: https://github.com/UglyToad/PdfPig
The Word Counter
The project will be a simple console application and will have the following structure:
Creating The Project
With Visual Studio 2022, follow the steps below:
Open Visual Studio 2022
Create New Project
Select Console Application in C#
Set Name and Path
Choose .NET 5.0 Framework
Now you have the basic structure of a console application. Create a folder and call it pdf. Add a pdf inside this folder. In this tutorial i used a pdf created from this page: https://en.wikipedia.org/wiki/Cr%C3%AApe
To get PdfPig:
On the search bar print Manage NuGet Packages
Click on Browse
Search PdfPig
Install It
The Code
Thanks to PdfPig extracting text from the pdf and calculating the occurrences of a word is trivial, here the full code:
using System;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;
namespace pdf_pig_word_counter
{
internal class Program
{
static void Main(string[] args)
{
string wordToFind = "pancake";
int numberOfOccurrences = 0;
using (PdfDocument document = PdfDocument.Open(@"YOUR PATH\pdf\test.pdf"))
{
foreach (Page page in document.GetPages())
{
string pageText = page.Text;
foreach (Word word in page.GetWords())
{
if (word.Text.ToLower().Contains(wordToFind.ToLower()))
numberOfOccurrences++;
}
}
Console.WriteLine("Total Occurrences: " + numberOfOccurrences);
}
}
}
}
This program will tell us how many times the word pancake is present in the pdf.
You can find the project here: https://github.com/CertosinoLab/mediumarticles/tree/pdf_pig_word_counter
Thank you!
Top comments (0)