Stephen Collins

Posted on Dec 16, 2024

How to Use PydanticAI for Structured Outputs with Multimodal LLMs

#ai #pydantic #multimodalllm #structuredoutputs

As multimodal AI models like OpenAI's GPT-4o become increasingly capable, developers now have the ability to process both images and text seamlessly. However, this power introduces new challenges:

How can you ensure structured and predictable outputs?
How can you manage workflows cleanly with minimal boilerplate?
How do you test and validate AI outputs effectively?

The answer lies in PydanticAI. It combines the strength of Pydantic schemas with agent-based workflows, ensuring data validation, clean structure, and reusability. In this tutorial, you'll learn how to:

Use PydanticAI to extract structured data from multimodal inputs.
Build reusable agents and tools for clean, modular AI workflows.
Pass "conversations" between agents to extend workflows dynamically.
Write robust tests with mock services to simulate real-world scenarios.

What We'll Build

We'll develop a multimodal AI workflow with two agents:

Invoice Processing Agent: Extract structured details like total amount, sender, and line items from an invoice image.
Summary Agent: Summarize the extracted details into a few concise sentences.

Along the way, you'll learn to:

Structure outputs using Pydantic models.
Integrate tools and dependencies cleanly with PydanticAI.
Pass data between agents for extended workflows.
Test your agents with mock services and edge cases.

By the end, you'll have a robust, reusable, and testable AI workflow.

The full codebase is available on GitHub at example-pydantic-ai-multi-modal.

Step 1: Defining the Structured Outputs

To ensure clean and predictable outputs, we use Pydantic models to define schemas. This guarantees the LLM's responses match our required structure.

Core Output Models

from pydantic import BaseModel, Field

class LineItem(BaseModel):
    """Structured representation of a line item in an invoice."""
    description: str = Field(description="Description of the line item.")
    quantity: int = Field(description="Quantity of the line item.")
    unit_price: float = Field(description="Unit price of the line item.")
    total_price: float = Field(description="Total price for the line item.")

class InvoiceExtractionResult(BaseModel):
    """Structured response for invoice extraction."""
    total_amount: float = Field(description="The total amount extracted from the invoice.")
    sender: str = Field(description="The sender of the invoice.")
    date: str = Field(description="The date of the invoice.")
    line_items: list[LineItem] = Field(description="The list of line items in the invoice.")

This schema validates that the extracted details include a total amount, sender, date, and line items.

Step 2: Building the Multimodal LLM Service

To interact with OpenAI's GPT-4o, we create a reusable service that:

Encodes the image as Base64.
Sends a multimodal request (text + image) to GPT-4o.
Returns structured outputs validated against our Pydantic models.

Service Implementation

import os
import base64
from openai import OpenAI

class MultimodalLLMService:
    """Service to interact with OpenAI multimodal LLMs."""
    def __init__(self, model: str):
        self.client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
        self.model = model

    async def perform_task(self, image_path: str, response_model: type, max_tokens: int = 5000):
        """Send an image and prompt to the LLM and return structured output."""
        with open(image_path, "rb") as image_file:
            base64_image = base64.b64encode(image_file.read()).decode("utf-8")

        messages = [
            {"role": "system", "content": "You are an assistant that extracts details from invoices."},
            {"role": "user", "content": [
                {"type": "text", "text": "Extract the details from this invoice."},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
            ]}
        ]

        response = self.client.beta.chat.completions.parse(
            model=self.model,
            messages=messages,
            max_tokens=max_tokens,
            response_format=response_model
        )
        return response.choices[0].message.parsed

This service is reusable and modular, making it easy to integrate into agents.

Step 3: Creating Agents with PydanticAI

PydanticAI agents orchestrate workflows, using tools to interact with the service and validate outputs.

Invoice Processing Agent

from pydantic_ai import Agent, RunContext
from dataclasses import dataclass

@dataclass
class InvoiceProcessingDependencies:
    llm_service: MultimodalLLMService
    invoice_image_path: str

invoice_processing_agent = Agent(
    "openai:gpt-4o-mini",
    deps_type=InvoiceProcessingDependencies,
    result_type=InvoiceExtractionResult,
    system_prompt="Extract the total amount, sender, date, and line items from the given invoice image."
)

@invoice_processing_agent.tool
async def extract_invoice_details(ctx: RunContext[InvoiceProcessingDependencies]) -> InvoiceExtractionResult:
    """Tool to extract invoice details."""
    return await ctx.deps.llm_service.perform_task(
        image_path=ctx.deps.invoice_image_path,
        response_model=InvoiceExtractionResult
    )

Summary Agent

summary_agent = Agent(
    "openai:gpt-4o-mini",
    result_type=str,
    system_prompt="Summarize the extracted invoice details into a few sentences."
)

The summary agent takes previously extracted details and generates a concise summary.

Step 4: Passing Data Between Agents

PydanticAI allows you to pass "conversations" (message histories) between agents. This makes workflows extensible and modular.

Main Workflow

async def main():
    deps = InvoiceProcessingDependencies(
        llm_service=MultimodalLLMService(model="gpt-4o-mini"),
        invoice_image_path="images/invoice_sample.png"
    )

    # Step 1: Extract invoice details
    result = await invoice_processing_agent.run(
        "Extract the total amount, sender, date, and line items from this invoice.", deps=deps
    )
    print("Structured Result:", result.data)
    print("=" * 100)

    # Step 2: Summarize extracted details
    summary = await summary_agent.run(
        "Summarize the invoice details in a few sentences.", message_history=result.new_messages()
    )
    print("Summary:", summary.data)

if __name__ == "__main__":
    asyncio.run(main())

Step 5: Testing Agents with Mock Services

Testing ensures reliability. We use mock services to simulate API responses and validate outputs.

Mock Service for Successful Extraction

class MockMultimodalLLMService:
    async def perform_task(self, image_path: str, response_model: type, max_tokens: int = 100):
        return response_model(
            total_amount=123.45,
            sender="Test Sender",
            date="2023-10-01",
            line_items=[
                LineItem(description="Item 1", quantity=1, unit_price=100.0, total_price=100.0)
            ]
        )

Example Test Case

async def test_invoice_extraction():
    """Test the invoice processing agent with a mock LLM service."""
    deps = InvoiceProcessingDependencies(
        llm_service=MockMultimodalLLMService(),
        invoice_image_path="invoice_sample.png",
    )

    with invoice_processing_agent.override(
        model=TestModel(custom_result_args={
            "total_amount": 123.45,
            "sender": "Test Sender",
            "date": "2023-10-01",
            "line_items": [
                LineItem(description="Item 1", quantity=1, unit_price=100.0, total_price=100.0),
                LineItem(description="Item 2", quantity=2, unit_price=11.725, total_price=23.45)
            ]
        })
    ):
        result = await invoice_processing_agent.run(
            "Extract the total amount, sender, date, and line items from this invoice.",
            deps=deps
        )

    assert isinstance(result.data, InvoiceExtractionResult)
    assert result.data.total_amount == 123.45
    assert result.data.sender == "Test Sender"
    assert result.data.date == "2023-10-01"
    assert len(result.data.line_items) == 2
    assert result.data.line_items[0].description == "Item 1"
    assert result.data.line_items[1].description == "Item 2"

Conclusion

By combining PydanticAI with OpenAI's multimodal GPT-4o, you can extract and validate structured outputs effortlessly. Key takeaways include:

Pydantic Models: Guarantee predictable outputs.
Agents and Tools: Modularize workflows for clean code.
Passing Conversations: Extend workflows dynamically between agents.
Testing with Mock Services: Ensure reliability and handle edge cases.

You can start implementing structured AI workflows by checking out the full codebase on GitHub:

example-pydantic-ai-multi-modal.

Forem