Introduction
Working with legacy databases can be a nightmare. Over the years, data inconsistencies, unvalidated entries, and incorrect formatting creep into the system, making it chaotic and difficult to manage. When migrating such data into a new structured database, a manual cleanup can be very time-consuming depending on the size of the database. Fortunately, we can use AI advancements to automate the transformation process efficiently.
In this article, I’ll walk you through my experience transforming unstructured product data from an outdated database into a well-structured format using OpenAI's API. I'll cover the entire process, from data preparation and API communication to safe execution and error handling, ensuring a smooth and efficient transformation.
Challenges in the Legacy Database
The old database contained multiple issues:
- Incorrect column usage – Data stored in irrelevant fields
- Lack of validation – Inconsistent formats, e.g., product codes and descriptions
- Redundant fields – Unnecessary columns accumulating over time
- Unstructured text – Data mixed up in wrong columns, typos
- Multiple sources – Data spread across various CSV files
Steps to Clean and Transform the Data
1. Defining a Structured Data Model
Before processing the data, I defined a structured schema using Python’s Pydantic library to ensure consistency in the transformed data. This structure only contained columns that I wanted the AI to generate. Each class represented one row in a table, with attributes and a log for transformation tracking.
from pydantic import BaseModel
from typing import List, Optional
class TransformLog(BaseModel):
needs_review: bool
lost_data: List[str]
modified_data: List[str]
other_data_modifications: List[str]
comment: str
class Product(BaseModel):
name: Optional[str]
description: Optional[str]
location: List[int]
category: Optional[str]
All those table entries would be collected in a scheme class and another class would be representing a list of products.
class ProductScheme(BaseModel):
product: Product
log: TransformLog
class ProductList(BaseModel):
items: List[ProductScheme]
This model ensured that any inconsistencies in the OpenAI API response could be validated and logged.
2. Preprocessing CSV Data
The next step involved loading the old database as a csv file.
import pandas as pd
def get_preprocessed_data(filename="data.csv"):
return pd.read_csv(filename, delimiter=';')
3. Chunking Data for API Requests
To prevent sending a system prompt for every single row and having excessive token usage, the data was split into smaller chunks before being sent. In my case, 15 was the optimal batch size.
CHUNK_SIZE = 15
def chunk_dataframe(df, chunk_size=CHUNK_SIZE):
for i in range(0, len(df), chunk_size):
yield df.iloc[i:i + chunk_size]
4. Preparing Data for the API
Since AI is not 100% accurate and API calls cost money, data that was already structured (e.g., IDs or boolean fields) was removed before being sent to the API and added back later on.
def remove_columns_for_api(chunk):
columns_to_remove = ["Product_ID", "is_active"]
return chunk.drop(columns=columns_to_remove)
5. Sending Data to OpenAI API
Each prepared chunk was sent to OpenAI’s API, with a structured prompt instructing it to format the data correctly.
import openai
SYSTEM_PROMPT = "Transform the following data into a structured product format..."
def query_api(data_chunk):
data_bundle = '\n'.join([
', '.join([f"{col}: {row[col]}" for col in data_chunk.columns if row[col]])
for _, row in data_chunk.iterrows()
])
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Transform the following {len(data_chunk)} rows:\n{data_bundle}"}
],
temperature=0.2,
response_format=ProductList
)
return response.choices[0].message.parsed.data
This would return a ProductList object.
6. Supplementing Data
After receiving the response, we supplemented the data for each row. Here, we also added a foreign key to ensure data consistency across tables.
from uuid import uuid4
def supplement_data(old, new):
entry_json = json.loads(new.model_dump_json())
new_id = str(uuid4())
entry_json["product"]["id"] = new_id
entry_json["product"]["is_active"] = bool(old.get("is_active"))
return entry_json
7. Safe Execution & Error Handling
To avoid data loss due to crashes, the script saves processed data after each batch and supports safe exits with a SIGINT handler.
import signal
safe_exit = [False]
def signal_handler(*_):
safe_exit[0] = True
signal.signal(signal.SIGINT, signal_handler)
Additionally, the last processed index is saved to a file to resume execution if interrupted.
def save_index(index, file_name="index.txt"):
with open(file_name, 'w') as f:
f.write(str(index))
def load_index(file_name="index.txt"):
try:
with open(file_name, 'r') as f:
return int(f.read().strip())
except FileNotFoundError:
return 0
def save_new_data(added_data, file_name="new_data.json"):
with open(file_name, 'r+', encoding='utf-8') as f:
content = f.read()
if not content.strip():
file_data = []
else:
file_data = json.loads(content)
file_data += added_data
f.seek(0)
f.truncate()
json.dump(file_data, f, ensure_ascii=False, indent=4)
f.close()
Finally, I have added error handling for problems such as an incorrect length of the product list. If such an error occurs, it would try again, and on a second error stop.
8. Running the Transformation Process
The main script iterates through the data in chunks, transforming, supplementing, and saving the output while handling error
I have uploaded a example code on my GitHub:
lenardflx / ai-data-migration
A Python-based tool that uses OpenAI's API to transform unstructured legacy data into a structured format. Supports safe execution, error handling, and automated data validation.
ai-data-migration
A Python-based tool that uses OpenAI's API to transform unstructured legacy data into a structured format. Supports safe execution, error handling, and automated data validation.
Conclusion
Using AI to structure legacy product data allowed us to merge everything into our new database with minimal manual cleanup efforts. If you’re facing similar challenges, consider using AI-powered data migration. If you have any questions, feel free to reach out!
Top comments (0)