Seenivasa Ramadurai

Posted on Feb 17

Microsoft Presidio and LangGraph: Enhancing AI Agents with Robust PII Protection and Data Anonymization

In today's data-driven world, protecting personally identifiable information (PII) is crucial for businesses and organizations. Microsoft Presidio, an open-source Python framework, offers a robust solution for detecting and anonymizing sensitive data in text and images. This blog post will explore Presidio's key features, its integration with LangChain, and how it can be implemented in various scenarios.

Understanding Presidio's Core Components

Presidio's functionality revolves around two main engines:

AnalyzerEngine

The AnalyzerEngine helps identify Personally Identifiable Information (PII) in text. It works by:

Running a variety of PII recognizers, each designed to spot different types of PII.
Using several detection methods like regex, Named Entity Recognition (NER), and other smart logic.
Supporting pre-built recognizers and making it easy to add custom ones.
Key features of the AnalyzerEngine include:

A RecognizerRegistry that stores all available entity recognizers.

An NlpEngine that processes the text, extracting useful features like tokens, lemmas, and entities.

Support for multiple languages, with English as the default.

AnonymizerEngine

The AnonymizerEngine takes care of anonymizing any PII found in the text. Here's how it works:

It takes the original text and the PII results from the AnalyzerEngine.
It then applies chosen anonymization methods to replace, remove, or transform the detected PII.
The AnonymizerEngine provides several ways to anonymize PII:

Replace: Swap out PII with a specified value or type.
Redact: Completely remove PII from the text.
Hash: Apply a secure hash like SHA256, SHA512, or MD5 to the PII.
Mask: Replace PII with a series of characters.
Encrypt: Protect the PII with AES encryption.
Custom: Create your own anonymization method.

Together, the AnalyzerEngine and AnonymizerEngine work seamlessly to detect and protect sensitive information in text, making Presidio a great tool for organizations focused on data privacy and compliance.

Microsoft Presidio is a powerful data protection and de-identification SDK that helps organizations manage and govern sensitive data effectively. It provides fast identification and anonymization modules for private entities in text and images, such as credit card numbers, names, locations, social security numbers, and more1.

In this blog post, we'll explore how to use Microsoft Presidio in conjunction with LangGraph to create a robust anonymization pipeline. We'll demonstrate this using a Python script that leverages Presidio's capabilities within a graph-based workflow.

The diagram below illustrates the two main engines of Microsoft Presidio and how they work together to detect and anonymize data

from typing import TypedDict

from langgraph.graph import StateGraph, START,END
from dotenv import  load_dotenv
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

load_dotenv()

class AnonymizerRequest(TypedDict):
    text:str
    anonymized_text:str


def anonymize_text(state:AnonymizerRequest):
    analyzer =AnalyzerEngine()
    anonymizer =AnonymizerEngine()
    original_text =state["text"]
    result =analyzer.analyze(text=original_text,language="en")
    state["anonymized_text"] =anonymizer.anonymize(text=original_text,analyzer_results=result).text
    return {"text":original_text +" anonymized_text="+state["anonymized_text"]}

subgraph_builder =StateGraph(AnonymizerRequest)
subgraph_builder.add_node("get_anonymized",anonymize_text)
subgraph_builder.add_edge(START,"get_anonymized")

subgraph = subgraph_builder.compile()
image = subgraph.get_graph().draw_ascii()
print(image)
image1 =subgraph.get_graph().draw_png()

with open("sreeni_ann.png","wb") as file:
    file.write(image1)
state ={"text":"I am sreeni my email address is sreeni@outlook.com","anonymized_text":""}
print(subgraph.invoke(state))

class ParentState(TypedDict):
    text:str

parent_graph_builder= StateGraph(ParentState)

def text_to_get_anonymized(state:ParentState):
    return {"text": "anonymize text: " + state["text"]}

parent_graph_builder.add_node("text_to_anonymize",text_to_get_anonymized)
parent_graph_builder.add_node("call_anonymize_text_subgraph",subgraph)
parent_graph_builder.add_edge(START,"text_to_anonymize")
parent_graph_builder.add_edge("text_to_anonymize","call_anonymize_text_subgraph")
parent_graph_builder.add_edge("call_anonymize_text_subgraph",END)

graph = parent_graph_builder.compile()

image = graph.get_graph().draw_ascii()
print(image)
image1 =graph.get_graph().draw_png()

with open("sreeni_Main_anonymize_subgraph.png","wb") as file:
    file.write(image1)

# Run the graph
result= graph.invoke({"text": "I noticed unauthorized transactions on my bank account 123456789012 and credit card 4111 1111 1111 1111. I am also locked out of my email sreeni@example.com, and my US, DL D1234567, and phone number 123-456-7890 may have been compromised. Please take immediate action to secure my accounts and advise on the next steps.Best,Sreeni Ramadurai Phone: (123) 456-7890 Email: sreeni@example.com"}, subgraphs=True)
print(result)

Output has both Input and Output as Anonymized text :

((), {'text': 'anonymize text: I noticed unauthorized transactions on my bank account 123456789012 and credit card 4111 1111 1111 1111. I am also locked out of my email sreeni@example.com, and my US, DL D1234567, and phone number 123-456-7890 may have been compromised. Please take immediate action to secure my accounts and advise on the next steps.Best,Sreeni Ramadurai Phone: (123) 456-7890 Email: sreeni@example.com anonymized_text=anonymize text:

Why I added as Subgraph?

A subgraph is a graph that is used as a node in another graph ,Some reasons for using subgraphs are:

Building multi-agent systems, when you want to reuse a set of nodes in multiple graphs, which maybe share some state, you can define them once in a subgraph and then use them in multiple parent graphs

Thanks
Sreeni Ramadorai

Top comments (8)

Muthu ganapathi • Feb 17

Nicely articulated

Seenivasa Ramadurai • Feb 17

Thank you! I'm glad you found it well-articulated. Appreciate your feedback!

Ravi • Feb 17

well articulated Blog, good example to understand! Thanks seeni

Seenivasa Ramadurai • Feb 17

Thank you Ravi

Pankaj Jainani • Feb 17

I enjoyed this ..

Seenivasa Ramadurai • Feb 17

I am glad

Abhi Pathak • Feb 17

Enjoyed knowing this one, well written and explained! Thanks Sreeni !!

Seenivasa Ramadurai • Feb 17

Thank you so much for your kind words! I'm glad you enjoyed the post. If you have any more questions or thoughts, feel free to share. Appreciate your feedback!

Forem

Microsoft Presidio and LangGraph: Enhancing AI Agents with Robust PII Protection and Data Anonymization

Understanding Presidio's Core Components

AnalyzerEngine

AnonymizerEngine

Output has both Input and Output as Anonymized text :

Top comments (8)

Read next

Technical Terms Every Developer Should Know

GitHub Releases New GPT-4o Copilot Code Completion Model

Practice Program in java Day-10

The Role of Machine Learning in Predictive Analytics