DEV Community

harshit-lyzr
harshit-lyzr

Posted on

Generative AI Dataset Generator App with Streamlit and Lyzr

In today’s data-driven world, generating realistic datasets is essential for testing, training machine learning models, and conducting meaningful analysis. To streamline this process, we present a Streamlit app that leverages the power of Lyzr Automata, a framework that simplifies building and managing AI-driven workflows. This blog will guide you through creating a Dataset Generator app using Lyzr Automata, OpenAI’s GPT models, and Streamlit.

Problem Statement
Creating datasets manually can be time-consuming and prone to errors, especially when the data needs to be diverse and realistic. Automating dataset generation ensures consistency, saves time, and allows data engineers to focus on more complex tasks. This app aims to solve the problem of manual dataset creation by providing an easy-to-use interface where users can specify the format, fields, and number of entries for the dataset.

Solution
Our Streamlit-based Dataset Generator app leverages Lyzr Automata to automate the creation of datasets. Users can input their dataset format (CSV or Table), define the fields they need, and specify the number of entries. The app then generates a dataset that meets these criteria using an AI model.

Why Lyzr Automata?
Lyzr Automata is used for its advanced capabilities in creating and managing AI agents and workflows, particularly in the context of Generative AI. Here are some key reasons why Lyzr Automata is beneficial:

Ease of Integration: Lyzr Automata can be easily integrated into existing systems and workflows, making it convenient to implement AI-driven solutions without a complete overhaul of current processes.
Automation: It helps automate repetitive tasks, reducing the manual effort required and increasing efficiency. This is particularly useful in tasks such as data preprocessing, content generation, and workflow management.
Customization: Lyzr Automata offers a high degree of customization, allowing users to tailor AI agents to specific needs and requirements. This flexibility ensures that the solutions are aligned with business goals and objectives.
Scalability: The platform is designed to scale seamlessly, accommodating increasing workloads and expanding as the business grows. This makes it suitable for both small-scale projects and large enterprise applications.
Performance Optimization: Lyzr Automata includes tools for monitoring and optimizing the performance of AI agents, ensuring they operate efficiently and effectively.
Support for Generative AI: It is particularly strong in supporting generative AI applications, such as creating text, images, and other content types, making it a valuable tool for businesses looking to leverage generative AI capabilities.

How the App Works
User Interface: The app uses Streamlit for its user-friendly interface. Users can enter their OpenAI API key, specify the format (CSV or Table), define the fields, and set the number of entries for the dataset.
Lyzr Automata Workflow: The app defines a workflow using Lyzr Automata, where an agent powered by OpenAI’s GPT-4 generates the dataset based on user inputs.
Dataset Generation: The specified format, fields, and number of entries are used to create a realistic and diverse dataset. The generated dataset is displayed within the app.

Setting Up the Environment
Imports:

Imports necessary libraries: streamlit, libraries from lyzr_automata

pip install lyzr_automata streamlit
Enter fullscreen mode Exit fullscreen mode
import streamlit as st
from lyzr_automata.ai_models.openai import OpenAIModel
from lyzr_automata import Agent,Task
from lyzr_automata.pipelines.linear_sync_pipeline import LinearSyncPipeline
from PIL import Image

Enter fullscreen mode Exit fullscreen mode

Sidebar Configuration

api = st.sidebar.text_input("Enter our OPENAI API KEY Here", type="password")
if api:
    openai_model = OpenAIModel(
        api_key=api,
        parameters={
            "model": "gpt-4-turbo-preview",
            "temperature": 0.2,
            "max_tokens": 1500,
        },
    )
else:
    st.sidebar.error("Please Enter Your OPENAI API KEY")
Enter fullscreen mode Exit fullscreen mode

if api:: Checks if an API key is entered.

openai_model = OpenAIModel(): If a key is entered, creates an OpenAIModel object with the provided API key, model parameters (gpt-4-turbo-preview, temperature, max_tokens).
else: If no key is entered, displays an error message in the sidebar.
api_documentation Function:

def dataset_generation(format, fields, entries):
    dataset_agent = Agent(
        prompt_persona=f"You are a Data Engineer with over 10 years of experience.you cares about data integrity and believes in the importance of realistic datasets for meaningful analysis.",
        role="Data Engineer",
    )

    dataset = Task(
    name="Dataset generation",
    output_type=OutputType.TEXT,
    input_type=InputType.TEXT,
    model=openai_model,
    agent=dataset_agent,
    log_output=True,
    instructions=f"""
    Please generate a dataset in {format} format with the following fields:
    {fields}

    The dataset should contain {entries} entries.Each entry should be unique and provide a diverse representation across all fields.
    Ensure the entries are realistic and diverse.

    Accuracy is important, so ensure that {fields} are plausible and realistic. If using fictional data, maintain consistency and coherence within the dataset.
    Please provide the generated Dataset or output in the specified format.

    [!Important]Only generate Dataset nothing apart from it.
    """,
    )

    output = LinearSyncPipeline(
        name="Dataset Generation",
        completion_message="Dataset Generated!",
        tasks=[
            dataset
        ],
    ).run()
    return output[0]['task_output']
Enter fullscreen mode Exit fullscreen mode

def dataset_generation(format, fields, entries):: Defines a function named dataset_generation that takes three arguments: format (CSV or Table), fields (comma-separated list of dataset fields), and entries (number of entries to generate).
dataset_agent = Agent(): Creates an Agent object defining the prompt persona and role ("Data Engineer").
dataset = Task(): Creates a Task object specifying details about the dataset generation task.
name: Sets the task name to “Dataset generation”.
output_type: Sets the expected output type as text.
input_type: Sets the input type for the task as text.
model: Assigns the openai_model object (if API key is provided).
agent: Assigns the dataset_agent object.
log_output: Sets logging for the task output to True.
instructions: Defines a multi-line string containing instructions for the AI model. The instructions specify the desired format, fields, number of entries, data characteristics (unique, diverse, realistic), and output format.
output = LinearSyncPipeline(): Creates a LinearSyncPipeline object named "Dataset Generation" with a completion message and assigns the dataset task to it.
return output[0][‘task_output’]: Runs the pipeline, retrieves the task output from the first element (index 0) of the results, and returns it.

User Code Input:

specify_format = st.selectbox("Enter format", ["CSV","Table"],placeholder="CSV")
specify_fields = st.text_area("Enter Fields", placeholder="Name: Customer Name, Age: Customer Age",height=300)
no_entries = st.number_input("Enter number of entries", placeholder="10")
Enter fullscreen mode Exit fullscreen mode

specify_format = st.selectbox(): Creates a dropdown menu named “Enter format” with options “CSV” and “Table” for users to select the desired dataset format.

specify_fields = st.text_area(): Creates a multi-line text area named “Enter Fields” where users can input a comma-separated list of dataset fields (e.g., Name: Customer Name, Age: Customer Age).

no_entries = st.number_input(): Creates a number input field named “Enter number of entries” where users can specify the desired number of entries for the generated dataset.

Generate Button and Output Display:

if st.button("Generate"):
    solution = dataset_generation(specify_format, specify_fields, no_entries)
    st.markdown(solution)
Enter fullscreen mode Exit fullscreen mode

if st.button(“Generate”):: Creates a button labeled “Generate”. If the button is clicked, the following code block executes.
solution = dataset_generation(): Calls the dataset_generation function with the user-selected format, entered fields, and number of entries.
st.markdown(solution): Displays the generated dataset output as markdown formatted text on the app.

Running the App
Finally, run the app using the following command in your terminal:

streamlit run app.py
Enter fullscreen mode Exit fullscreen mode

try it now: https://github.com/harshit-lyzr/dataset_generator

For more information explore the website: Lyzr

Contibute to Our Project: https://github.com/LyzrCore/lyzr-automata

Top comments (0)