DEV Community

Lönard
Lönard

Posted on

Transforming Unstructured Data into a Clean Database using OpenAI API

Introduction

Working with legacy databases can be a nightmare. Over the years, data inconsistencies, unvalidated entries, and incorrect formatting creep into the system, making it chaotic and difficult to manage. When migrating such data into a new structured database, a manual cleanup can be very time-consuming depending on the size of the database. Fortunately, we can use AI advancements to automate the transformation process efficiently.

In this article, I’ll walk you through my experience transforming unstructured product data from an outdated database into a well-structured format using OpenAI's API. I'll cover the entire process, from data preparation and API communication to safe execution and error handling, ensuring a smooth and efficient transformation.

Challenges in the Legacy Database

The old database contained multiple issues:

  • Incorrect column usage – Data stored in irrelevant fields
  • Lack of validation – Inconsistent formats, e.g., product codes and descriptions
  • Redundant fields – Unnecessary columns accumulating over time
  • Unstructured text – Data mixed up in wrong columns, typos
  • Multiple sources – Data spread across various CSV files

Steps to Clean and Transform the Data

1. Defining a Structured Data Model

Before processing the data, I defined a structured schema using Python’s Pydantic library to ensure consistency in the transformed data. This structure only contained columns that I wanted the AI to generate. Each class represented one row in a table, with attributes and a log for transformation tracking.

from pydantic import BaseModel
from typing import List, Optional

class TransformLog(BaseModel):
    needs_review: bool
    lost_data: List[str]
    modified_data: List[str]
    other_data_modifications: List[str]
    comment: str

class Product(BaseModel):
    name: Optional[str]
    description: Optional[str]
    location: List[int]
    category: Optional[str]
Enter fullscreen mode Exit fullscreen mode

All those table entries would be collected in a scheme class and another class would be representing a list of products.

class ProductScheme(BaseModel):
    product: Product
    log: TransformLog

class ProductList(BaseModel):
    items: List[ProductScheme]
Enter fullscreen mode Exit fullscreen mode

This model ensured that any inconsistencies in the OpenAI API response could be validated and logged.

2. Preprocessing CSV Data

The next step involved loading the old database as a csv file.

import pandas as pd

def get_preprocessed_data(filename="data.csv"):
    return pd.read_csv(filename, delimiter=';')
Enter fullscreen mode Exit fullscreen mode

3. Chunking Data for API Requests

To prevent sending a system prompt for every single row and having excessive token usage, the data was split into smaller chunks before being sent. In my case, 15 was the optimal batch size.

CHUNK_SIZE = 15

def chunk_dataframe(df, chunk_size=CHUNK_SIZE):
    for i in range(0, len(df), chunk_size):
        yield df.iloc[i:i + chunk_size]
Enter fullscreen mode Exit fullscreen mode

4. Preparing Data for the API

Since AI is not 100% accurate and API calls cost money, data that was already structured (e.g., IDs or boolean fields) was removed before being sent to the API and added back later on.

def remove_columns_for_api(chunk):
    columns_to_remove = ["Product_ID", "is_active"]
    return chunk.drop(columns=columns_to_remove)
Enter fullscreen mode Exit fullscreen mode

5. Sending Data to OpenAI API

Each prepared chunk was sent to OpenAI’s API, with a structured prompt instructing it to format the data correctly.

import openai

SYSTEM_PROMPT = "Transform the following data into a structured product format..."

def query_api(data_chunk):
    data_bundle = '\n'.join([
        ', '.join([f"{col}: {row[col]}" for col in data_chunk.columns if row[col]])
        for _, row in data_chunk.iterrows()
    ])

    response = openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Transform the following {len(data_chunk)} rows:\n{data_bundle}"}
        ],
        temperature=0.2,
        response_format=ProductList
    )
    return response.choices[0].message.parsed.data
Enter fullscreen mode Exit fullscreen mode

This would return a ProductList object.

6. Supplementing Data

After receiving the response, we supplemented the data for each row. Here, we also added a foreign key to ensure data consistency across tables.

from uuid import uuid4

def supplement_data(old, new):
    entry_json = json.loads(new.model_dump_json())
    new_id = str(uuid4())
    entry_json["product"]["id"] = new_id
    entry_json["product"]["is_active"] = bool(old.get("is_active"))
    return entry_json
Enter fullscreen mode Exit fullscreen mode

7. Safe Execution & Error Handling

To avoid data loss due to crashes, the script saves processed data after each batch and supports safe exits with a SIGINT handler.

import signal
safe_exit = [False]

def signal_handler(*_):
    safe_exit[0] = True
signal.signal(signal.SIGINT, signal_handler)
Enter fullscreen mode Exit fullscreen mode

Additionally, the last processed index is saved to a file to resume execution if interrupted.

def save_index(index, file_name="index.txt"):
    with open(file_name, 'w') as f:
        f.write(str(index))

def load_index(file_name="index.txt"):
    try:
        with open(file_name, 'r') as f:
            return int(f.read().strip())
    except FileNotFoundError:
        return 0

def save_new_data(added_data, file_name="new_data.json"):
    with open(file_name, 'r+', encoding='utf-8') as f:
        content = f.read()
        if not content.strip():
            file_data = []
        else:
            file_data = json.loads(content)
        file_data += added_data
        f.seek(0)
        f.truncate()
        json.dump(file_data, f, ensure_ascii=False, indent=4)
        f.close()

Enter fullscreen mode Exit fullscreen mode

Finally, I have added error handling for problems such as an incorrect length of the product list. If such an error occurs, it would try again, and on a second error stop.

8. Running the Transformation Process

The main script iterates through the data in chunks, transforming, supplementing, and saving the output while handling error

I have uploaded a example code on my GitHub:

GitHub logo lenardflx / ai-data-migration

A Python-based tool that uses OpenAI's API to transform unstructured legacy data into a structured format. Supports safe execution, error handling, and automated data validation.

ai-data-migration

A Python-based tool that uses OpenAI's API to transform unstructured legacy data into a structured format. Supports safe execution, error handling, and automated data validation.






Conclusion

Using AI to structure legacy product data allowed us to merge everything into our new database with minimal manual cleanup efforts. If you’re facing similar challenges, consider using AI-powered data migration. If you have any questions, feel free to reach out!

Top comments (0)