Building a Lightweight OCR-Powered Receipt Parser

#webdev #beginners #python #ai

When working with scanned receipts, extracting structured data like dates, items, and prices can be tricky due to varying fonts, layouts, and image quality. ReceiptLogger is a lightweight, local-first application that utilizes PaddleOCR to process scanned receipts efficiently. By integrating PaddleOCR with a Tkinter GUI, the app extracts text from images, parses relevant details, and logs the data into a structured format like a Google Sheet. The goal is to make receipt digitization efficient, even on low-resource machines, ensuring quick and accurate text recognition.

Requirements

ReceiptLogger runs on macOS (tested on an M1 Mac) with Python 3.12.9. It uses PaddleOCR version 2.9.1 for text extraction and Tcl/Tk version 8 for the GUI. Make sure these dependencies are installed before running the app.

App Structure

The app has two main components, the main script containing the Tkinter class for the GUI (main.py) and the helper script that extracts the data from the OCR response (process_data.py).

main.py

main.py handles the core workflow of the app, from receiving images to extracting and storing data. It initializes a folder in Desktop to store uploaded receipt images, processes each image with PaddleOCR to extract text and relevant details, and finally uploads the structured data to Google Sheets for storage. Below are the two main parts of the class, the initialization of key variables and the creation of the UI components.

class ReceiptLogger:
    def __init__(self, root):
        self.initialize_variables(root)
        self.create_ui()

    def initialize_variables(self, root):
        self.ocr = PaddleOCR(use_angle_cls=True, lang='en')
        self.root = root
        self.root.title('ReceiptLogger')
        self.receipts_folder = os.path.join(os.path.expanduser('~'), 'Desktop', '🧾 RECEIPTS HERE')
        self.image_refs = []
        self.receipt_data_refs = []

    def create_ui(self):
        pass

PaddleOCR is an open-source Optical Character Recognition (OCR) tool built on PaddlePaddle, a deep learning framework. It is designed for extracting text from images, supporting multiple languages and text orientations. In ReceiptLogger, PaddleOCR processes receipt images to extract structured data like store names, dates, and item details.

    def extract_receipts(self):
        for widget in self.scroll_frame.winfo_children():
            widget.destroy()

        self.image_refs.clear()

        if not os.path.exists(self.receipts_folder):
            self.status_label.configure(text='❌ Receipts folder not found')
            return

        image_files = [os.path.join(self.receipts_folder, f) for f in os.listdir(self.receipts_folder) if f.lower().endswith('.png')]
        self.status_label.configure(text=f'✅ Found {len(image_files)} receipts')

        for img_path in image_files:
            try:
                ocr_output = self.ocr.ocr(img_path) 
                receipt_data = process(ocr_output) 
            except Exception as e:
                self.status_label.configure(text=f'❌ Error processing receipts: {str(e)}')
                return
            self.display(img_path, receipt_data)

The app uploads extracted receipt data to Google Sheets using the Google Sheets API. It authenticates with a service account, formats the extracted data into rows, and appends them to a specified worksheet. This allows easy access and organization of receipt records in a structured format.

    def upload_data(self):
        if not self.receipt_data_refs:
            self.status_label.configure(text='⚠️ Extract receipts before uploading')
            return

        self.status_label.configure(text='📤 Uploading to Google Sheets...')

        # authenticate and prepare data to append
        credentials = os.getenv('GOOGLE_KEY')
        sheet_id = os.getenv('SPREADSHEET_ID')
        worksheet_name = os.getenv('WORKSHEET_NAME')
        if not credentials:
            self.status_label.configure(text='❌ Google Service Key not found in .env')
            return
        try:
            creds = Credentials.from_service_account_file(credentials, scopes=['https://www.googleapis.com/auth/spreadsheets'])
            sheet = gspread.authorize(creds).open_by_key(sheet_id)
            worksheet = sheet.worksheet(worksheet_name)

            print(f'✅ Successfully connected to Google Sheet: {sheet.title}')
            self.status_label.configure(text=f'✅ Connected to Google Sheets "{sheet.title}"')

            rows_to_append = []
            for receipt in self.receipt_data_refs:
                store = receipt['store']
                date = datetime.strptime(receipt['date'], '%m/%d/%Y' if len(receipt['date']) == 10 else '%m/%d/%y').strftime('%m/%d/%y')
                tax_rate = receipt['tax_rate']
                item_map = {}

                for item in receipt['items']:
                    sku = item['sku']
                    if sku not in item_map:
                        item_map[sku] = {
                            'name': item['name'],
                            'quantity': 1,
                            'price': item['price'], 
                            'taxed': item['taxed']
                        }
                    else:
                        item_map[sku]['quantity'] += 1

                rows_to_append.extend([
                    [
                        store,
                        date,
                        sku,
                        data['quantity'],
                        data['name'],
                        data['price'],
                        tax_rate,
                        data['taxed']
                    ]
                    for sku, data in item_map.items()
                ])

            # append to google sheets
            if rows_to_append:
                next_empty_row = len(worksheet.get_all_values()) + 1
                worksheet.insert_rows(rows_to_append, row=next_empty_row, value_input_option='USER_ENTERED')

                print(f'✅ Successfully added {len(rows_to_append)} rows to Google Sheets')
                self.status_label.configure(text=f'✅ Uploaded {len(rows_to_append)} rows to Google Sheets')
            else:
                print('⚠️ No data to upload')
                self.status_label.configure(text='⚠️ No data to upload')

        except Exception as e:
            print(f'❌ Google Sheets connection error: {e}')
            self.status_label.configure(text='❌ Google Sheets authentication failed')

process_data.py

process_data.py extracts and processes structured receipt data from the OCR output. It identifies the store, extracts relevant details, and calculates tax for each item before returning the final structured data.

def extract_data(response=[]):
    if not response:
        print('🚨 No data to extract')
        return

    stores = {
        'homegoods': 'HomeGoods',
        'marshalls': 'Marshalls',
        'marshalls homegoods': 'Marshalls-HomeGoods',
        'ross': 'Ross',
        't.j.maxx': 'T.J.Maxx',
        'tjmaxx': 'T.J.Maxx'
    }
    store = ''

    if response[0][0][1][0].lower() in stores:
        store = stores[response[0][0][1][0].lower()]
    else:
        for item in response[0]:
            possible_text = item[1][0].lower()
            for key in stores.keys():
                if re.search(rf'\b{re.escape(key)}\b', possible_text):
                    store = stores[key]  
                    break

    if store in ['HomeGoods', 'Marshalls', 'Marshalls-HomeGoods', 'T.J.Maxx']:
        receipt_data = parse_tjx_receipt(response[0])
        receipt_data['store'] = store
    elif store == 'Ross':
        receipt_data = parse_ross_receipt(response[0])
        receipt_data['store'] = store 
    else:
        print('🚨 Not included in the list of stores')
        return {}

    return receipt_data

The helper functions parse_tjx_receipt and parse_ross_receipt handle store-specific receipt formats. These functions are tailored to specific store receipt formats, extracting structured data like store names, dates, items, and prices based on each store’s unique layout. This ensures accurate parsing for supported stores.

Running the App via Automator

To package ReceiptLogger into a macOS app using Automator, create a new Application in Automator and add a Run Shell Script action. Inside the script, add the following three lines:

cd /path/to/ReceiptLogger
source venv/bin/activate
python -m app.main

Save the Automator workflow as an application, then place it in your Applications folder or on your desktop. Clicking it will launch ReceiptLogger without needing to open a terminal.

Resources

Tkinter Documentation – Official Python Tkinter library documentation.
PaddleOCR Quick Start Guide – Getting started with PaddleOCR for text extraction.
You can find the full source code on GitHub ReceiptLogger.

DEV Community

Building a Lightweight OCR-Powered Receipt Parser

Requirements

App Structure

main.py

process_data.py

Running the App via Automator

Resources

Top comments (0)

Read next

A Beginner's Guide to C# Programming

Mastering the `<iframe>` Tag in React with TypeScript: A Comprehensive Guide

Understanding Localhost: What Happens When You Type It?

Dfusion AI: The Next Leap in AI-Powered Creativity