Four Data Preprocessing Techniques for Invoice OCR Using Generative AI
By leveraging generative AI, it has become easier to implement OCR that can handle flexible formats. However, to ensure accuracy, data preprocessing is crucial. In this article, we will introduce preprocessing techniques for invoice OCR using generative AI.
Benefits of OCR Using Generative AI
Traditional OCR primarily relies on predefined algorithms, rule-based approaches, or machine learning models to recognize character patterns when extracting text from images. However, it commonly struggles with challenges such as “unexpected layouts,” “special formats,” and “noise.”
In contrast, OCR powered by generative AI has the distinct advantage of comprehensively understanding image data to extract text more effectively. This enables it to handle “complex layouts,” “handwritten text,” and “documents with mixed tables and diagrams” — areas where traditional OCR often falls short. Additionally, generative AI-based OCR offers greater flexibility and can even perform error correction and interpretation, enhancing overall accuracy.
Improving the Accuracy of OCR Using Generative AI
While OCR powered by generative AI can handle diverse formats, it may sometimes be less stable in terms of accuracy compared to traditional OCR.
This is where data preprocessing becomes crucial.
In this article, we will introduce data processing techniques using Python libraries and machine learning approaches to enhance OCR performance.
Trimming Images Using the Pillow Library in Python
When passing images to generative AI, removing as much irrelevant information as possible can significantly improve accuracy.
For structured documents like invoices, where the format is somewhat standardized, you can incorporate an automatic image trimming process using Python’s Pillow library. This helps eliminate unnecessary margins or background noise, ensuring that the OCR focuses only on the essential text areas.
https://pypi.org/project/pillow/
Furthermore, instead of directly applying OCR to the trimmed image, converting the trimmed image into Markdown using generative AI has shown improvements in accuracy.
from PIL import Image
# Open the image.
image = Image.open('invoice_sample.jpg')
# Trimming area.
crop_area = (100,2400,1962,3100)
# Execute trimming.
cropped_image = image.crop(crop_area)
# Save the trimmed image.
cropped_image.save('invoice_sample_cropped.jpg')
# Display the trimmed image.
cropped_image.show()
Furthermore, when the trimmed image is converted into Markdown, it is output as shown below. It can also be confirmed that the numbers are correctly converted.
| Description | Units | Rate (USD) | Amount (USD) |
|--------------------------|-------------|------------|--------------|
| Basic Service Charge | 1 month | 850 | 850 |
| Electricity Usage | 320 kWh | 25/kWh | 8,000 |
| Fuel Adjustment Charge | 320 kWh | 1.5/kWh | 480 |
| Renewable Energy Fee | 320 kWh | 2.2/kWh | 704 |
| Environmental Fee | Flat rate | 150 | 150 |
Adjusting Image Contrast Using OpenCV in Python
By using the OpenCV library to adjust image contrast, it is possible to enhance readability and preprocess images for better OCR accuracy.
Manually correcting each image would be a tedious task, so an automated optimization approach can be used to adjust the image’s average brightness or median value to a target level. For example, a suitable gamma value can be determined based on the principle of bringing the average brightness of the image closer to 128 (the midpoint of the 0–255 range).
By adjusting the settings according to the characteristics of the images being processed, the accuracy of text recognition can be further improved.
import cv2
import numpy as np
# Load the image.
image = cv2.imread('invoice_sample_dark.png')
# Parameter.
brightness = 70
# Adjust the brightness.
bright_image = cv2.convertScaleAbs(image, alpha=1.0, beta=brightness)
# Save the image.
cv2.imwrite('invoice_sample_bright.png', bright_image)
# Display the original and adjusted image.
cv2.imshow('Original Image', image)
cv2.imshow('Brightness Adjusted Image', bright_image)
cv2.waitKey(0)
cv2.destroyAllWindows()
text
Combining Open-Source OCR with Python’s pytesseract Library
Python provides an open-source OCR library called pytesseract, which allows text extraction from images. By using this library, it is possible to convert image-based text into machine-readable text.
Passing this extracted text along with the image to generative AI can help reduce numerical errors, such as incorrect decimal places.
Although recognizing handwritten text can be challenging, pytesseract is free to use, making it an easy way to experiment with improving OCR accuracy.
import pytesseract
from PIL import Image
# For Windows only, specify the Tesseract installation path (adjust the installation path accordingly).
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# Load the image.
image = Image.open('invoice_sample.jpg')
# Execute OCR to extract text.
text = pytesseract.image_to_string(image, lang='eng')
# Display the extracted text.
print(text)
# Save the text to a file (optional).
with open('invoice_sample_text.txt', 'w', encoding='utf-8') as f:
f.write(text)
OCR result text file
—J+- AZU—yyayk 2025-03-07 13.42.53
Sample Invoice - Power Company
ABC Power Company
123 Energy Street, Tokyo, Japan
Phone: 03-1234-5678
nail sbcpower.co.jp
456 Residential Ave, Shibuya, Tokyo 150-0002
Account Number: 000123456789
Invoice Number: INV-20250307-00123
Billing Date: March 7, 2025 1 i i
Billing Period: February 1, 2025 — February 28, 2025
Usage Summary:
Description Units Rate (USD) Amount (USD) | \ iil |
Basic Service Charge 1 month 850 850. AAT |
Electricity Usage 320 kWh 25/kWh 8,000
Fuel Adjustment Charge 320 kWh 1.5/kKWh 480
Renewable Energy Fee 320 kWh 2.2/kKWh 704
Environmental Fee Flat rate 150 |) eo
Subtotal: $10,184
Consumption Tax (10%): $1,018
Total Amount Due: $11,202
Combine OCR tools (Azure Vision AI)
If pytesseract does not successfully extract text, you can also leverage existing OCR services.
For example, Azure Vision AI’s OCR service can accurately extract text from images. The extracted text is returned in JSON format via API. However, since this method only retrieves text data, table structures and other visual information from the image may be lost.
By having generative AI supplement the missing image information and combining it with the accurately extracted text, it becomes possible to extract specific information with higher accuracy.
Here is the result using Azure Vision Studio’s “Extract text from images” feature.
SampleInvoice-PowerCompany
ABCPowerCompany
123EnergyStreet,Tokyo,Japan
Phone:03-1234-5678
Email:billing@abcpower.co.jp
Invoiceto:
JohnDoe
456ResidentialAve,Shibuya,Tokyo150-0002
AccountNumber:000123456789
InvoiceNumber:INV-20250307-00123
BillingDate:March7,2025
BillingPeriod:February1,2025-February28,2025
UsageSummary:
Description
Unit
Rate(USD)
Amount(USD)
BasicServiceCharge
1month
850
850
ElectricityUsage
320kWh
25/kWh
8,000
FuelAdjustmentCharge
320kWh
1.5/kWh
480
+
RenewableEnergyFee
320kWh
2.2/kWh
704
EnvironmentalFee
Flatrate
150
150
Subtotal:$10,184
ConsumptionTax(10%):$1,018
TotalAmountDue:$11,202
Summary: Easily Integrate OCR into Workflows with Generative AI
As we have seen so far, by performing data preprocessing, it is possible to easily achieve high-accuracy OCR using generative AI.
Python also offers various powerful libraries. By leveraging the strengths of Python and cloud services, we can effectively utilize generative AI in real-world applications.
Top comments (2)
nice
Thanks for your comment !