Formatting LLM Responses: From Unstructured Text to Structured Outputs

Conversational applications often work well with plain-text responses from Large Language Models (LLMs). However, when we need to generate structured output like JSON, XML, or CSV, things get more exciting. Structured outputs are crucial when integrating LLMs with other applications, such as extracting data from bills or invoices to create database records or feeding structured data to a frontend application.

Why Structured Outputs?

Imagine processing text data like:

"My Name is Sreeni, I live in Dallas, TX, and I hold a degree in MS Computer Science. I have two cars: Toyota and Lexus. I am originally from India."

With plain text, the information is clear but unstructured. For use cases like creating a new database record or sending this data to another application, we need it in a structured format. This is where libraries like Pydantic come into play.

Introducing Pydantic for Data Validation

Pydantic is a powerful Python library that helps you define classes with attributes corresponding to the fields you want to extract. By defining a schema, we can seamlessly transform unstructured text into structured data.

Code Example: Structured Extraction with LangChain and Pydantic

Here’s how you can use LangChain and Pydantic to extract structured data:

from langchain_openai import ChatOpenAI
# Uncomment the following if you have the specific Pydantic version from LangChain
# from langchain_core.pydantic_v1 import BaseModel
from dotenv import load_dotenv, find_dotenv
from pydantic import BaseModel

# Define the schema using Pydantic
class GetUserDetails(BaseModel):
    '''Extract user contact details.'''
    name: str
    age: int  #  age is required in the above text data to extract but i purposely omitted so since it integer Type it assign default value to age which is zero   
    address: str
    cars: list
    degree: str
    country_from: str

# Load environment variables
load_dotenv()

# Initialize the LLM
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)

# Configure the LLM to output structured data
structured_llm = llm.with_structured_output(GetUserDetails)

# Invoke the LLM with unstructured input
response = structured_llm.invoke(
    "My Name is Sreeni, I live in Dallas, TX, and I hold a degree in MS Computer Science. "
    "I have two cars: Toyota and Lexus. I am originally from India."
)

# Print the extracted structured data
print(response.model_dump())  # Structured output as a dictionary
print(response)  # Human-readable response

Expected Output

When the above code runs, the structured output would look something like this:

Why This Matters

Structured outputs allow seamless integration between LLMs and other systems. For instance:

Database Automation: Create or update records using extracted data.
Frontend Applications: Deliver data in formats easily consumed by web or mobile apps.

Final Thoughts

Using Pydantic with LangChain simplifies the journey from unstructured text to structured data. This powerful combination ensures accuracy, clarity, and effortless data integration, enabling you to build robust applications powered by AI.

Thanks
Sreeni Ramadorai