Extracting Information from PDFs in Markdown Format with Mistral OCR

#python #ai #rag

A few hours ago, Mistral.ai launched a multimodal model called Mistral OCR, an Optical Character Recognition (OCR) API that can transform any PDF into a text file, making it easier to ingest into AI models.

Being multimodal, it is possible to extract any information from PDFs, whether it's text or images, and output it in Markdown format. Access to obtain an API key is currently available for free at https://console.mistral.ai/api-keys.

To demonstrate, I thought of a simple idea: querying the balance sheet of a company listed on the B3 (the Brazilian stock exchange). I chose Totvs:

Let's move on to the code:

What needs to be done is access the OCR Mistral API to check what should be passed as a payload via POST.

import requests
!pip install environs
from environs import Env

env = Env()
env.read_env()
API_KEY = env("API_KEY")

URL = "https://api.mistral.ai/v1/ocr"

DOCUMENT_URL = "https://api.mziq.com/mzfilemanager/v2/d/d3be5d49-62e7-4def-a3e1-ab25ff09f153/47e88f32-4521-452c-0fbc-c378770b451c?origin=1"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

payload = {
    "model": "mistral-ocr-2503",
    "id": "mistral-ocr-latest",
    "document": { 
        "type": "document_url",
        "document_url": DOCUMENT_URL,
        "document_name": "balanco_patrominal_totvs.pdf",  
    }
}

response = requests.post(URL, json=payload, headers=headers)

if response.status_code == 200:
    result = response.json()
    print("OCR:")
    print(result)
else:
    print(f"Error {response.status_code}: {response.text}")


with open('markdown_file.md', '+w') as f:
  f.write(result.get('pages')[28].get('markdown'))

Here is the result:

The goal of obtaining information in Markdown aligns with what we're seeing regarding the growth of GenAI, as this type of formatting has been widely used to train various LLM models. Furthermore, Markdown provides flexibility for generating reports that can be viewed or even converted into other forms of documentation.

Companies and developers will likely use Mistral OCR with a RAG system (also known as Retrieval Augmented Generation) to use multimodal documents as input to an LLM. There are many potential use cases, don't you agree? Leave your comment.

If this article helped you or you enjoyed it, consider contributing:

DEV Community

Extracting Information from PDFs in Markdown Format with Mistral OCR

Let's move on to the code:

Top comments (0)

Read next

Wan 2.1: The AI Video Revolution That's Out of This World (But Still Runs on Your Computer)

Quickly compare cost and results of different LLMs on the same prompt

How to Automate Developer OSINT with Python

This One Script Saves Me 10+ Hours Weekly