DEV Community

Cover image for Text-to-SQL: Creating Embeddings with Nebius AI Studio (part 1)
Sophia Parafina
Sophia Parafina

Posted on • Edited on

Text-to-SQL: Creating Embeddings with Nebius AI Studio (part 1)

Nebius AI Studio is an inference service offering state-of-the-art open-source large language models. The models include text, embeddings, and vision. This is the first article in a series demonstrating how to build a text-to-SQL Retrieval Augmented Generation (RAG) system.

Embeddings are an essential part of machine learning and artificial intelligence systems. They convert high-dimensional data such as text, sound, and images into numerical vectors that capture the meaning and relationships of the data.

As a first step, we will use the Nebius AI Studio API to create embeddings for a database. We can use the Northwinds Trader database, a sample database of a fictional company's transactions. This project uses Postgresql, and there's a Github repository to create the database. The Data Definition Language (DDL) in the SQL script will populate the RAG database, and we can instantiate it to test SQL generated by the LLM.

Converting SQL to Text

The first step is to create the text populating the vector database. We can use raw SQL, but providing context will improve results.

Here is the DDL for the customers table.

CREATE TABLE customers (
    customer_id character varying(5) NOT NULL,
    company_name character varying(40) NOT NULL,
    contact_name character varying(30),
    contact_title character varying(30),
    address character varying(60),
    city character varying(15),
    region character varying(15),
    postal_code character varying(10),
    country character varying(15),
    phone character varying(24),
    fax character varying(24)
);
Enter fullscreen mode Exit fullscreen mode

We can rewrite the SQL in markdown, providing context about the table and the columns.

Table 1: DB.NORTHWIND.CUSTOMERS (Stores customer information)

This table contains customer information.

- customer_id character varying(5) [Primary Key, Not Null] - Unique identifier for customers
- company_name character varying(40) - [NOT NULL]- Company where contact works 
- contact_name character varying(30) - Name of the customer
- contact_title character varying(30) - Title of the customer
- address character varying(60) - Physical address of the customer
- city character varying(15) - City where customer is located
- region character varying(15) - Region where customer is located
- postal_code character varying(10) - Customer's postal code
- country character varying(15) - Country where customer is located
- phone character varying(24) - Customer's phone number
- fax character varying(24) - Customer's fax number
Enter fullscreen mode Exit fullscreen mode

There are 11 Notwind tables annotated and converted to markdown.

Creating Embeddings

We can create vectors using the Nebius embedding models with the markdown data prepped. You'll need an API key from Nebius. Select API Keys in the menu bar and follow the instructions.

Nebius menu bar

Set the API key as an environment variable in Linux/macOS:

export NEBIUS_API_KEY-"xxxxxxxxxxxxxxxx...xxxxxxxxx"
Enter fullscreen mode Exit fullscreen mode

Set the API key as an environment variable in a Windows Powershell session.

$env:NEBIUS_API_KEY="xxxxxxxxxxxxxxxx...xxxxxxxxx"
Enter fullscreen mode Exit fullscreen mode

Let's dive into the code.

Create the database

Assuming that PostgreSQL is running, this function creates a new database called rag. If needed, it adds the pgvector extension and creates a table called items.

If you use Postgress.app on macOS or a hosted provider, pgvector is pre-installed. See your database's installation instructions if needed.

dimensions = 4096

def database_config():
    # enable extension
    conn = psycopg.connect(dbname='postgres',user="postgres", autocommit=True)
    conn.execute('CREATE DATABASE rag')
    conn = psycopg.connect(dbname='rag',user="postgres", autocommit=True)
    conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
    register_vector(conn)

    # create table
    conn.execute('DROP TABLE IF EXISTS items')
    conn.execute('CREATE TABLE items (id bigserial, chunk text, embedding vector({dimensions}))')
    return conn
Enter fullscreen mode Exit fullscreen mode

Create vectors and database records

The embedding process takes the markdown documents and splits the text into smaller chunks. The text is passed to the embedding service to convert them into vectors. The vectors and text are combined to make a record.

API_KEY = os.environ.get('NEBIUS_API_KEY')

client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=API_KEY,
)

def create_embeddings(files, docs_dir):
    data = []
    for filename in files:
        file_path = os.path.join(docs_dir,filename)
        with open(file_path, 'r', errors="replace") as file:
            document_text = file.read()
            chunks = recursive_chracter_splitter_chunking(document_text)

            for chunk in chunks:
                chunk = chunk.replace('\n',' ')
                chunk = chunk.replace("\x00", "")
                embedding = create_vectors(chunk)
                pc_list = [chunk, embedding]
                data.append(pc_list)

    return data

def create_vectors(text):
     embedding = client.embeddings.create(
           model="BAAI/bge-en-icl",
           input=text,
           dimensions=1024,
           encoding_format="float",
           ).data[0].embedding    
     return str(embedding)
Enter fullscreen mode Exit fullscreen mode

Insert the embeddings

This function adds the embeddings to the items table.

def upsert(embeddings, conn):
# insert records into the database
    for record in embeddings:
        chunk = str(record[0])
        vector = record[1]
        conn.execute('INSERT INTO items (chunk, embedding) VALUES (%s, %s)', (chunk, vector))
Enter fullscreen mode Exit fullscreen mode

Typically, we could create an index to improve retrieval. However, the Nebius embedding service only returns vectors with the maximum number of tokens, e.g., 4096. Many pgvector HNSW indices support only 2000 tokens, except for bit indices. While you can create a bit quantized index, recall can be negatively impacted. I've added sample code to create the index, but given the size of the data, an index is not necessary.

# create index
# database_conn.execute('CREATE INDEX ON items USING hnsw ((binary_quantize(embedding)::bit(4096)) bit_hamming_ops);')
Enter fullscreen mode Exit fullscreen mode

Next steps

This article demonstrates how to create embeddings from the Northwinds database DDL and insert them in PostgreSQL with the pgvector extension. The following article shows how to build the client to generate SQL from a prompt.

Top comments (1)

Collapse
 
tejas_kumar_83c520d6bef27 profile image
Tejas Kumar

Great article on implementing vector search with PostgreSQL! For those looking to scale their vector search implementations without managing infrastructure, Astra DB also offers a really nice vector-ready non-relational solution. The approach you've outlined for text-to-SQL with embeddings is solid and could work across different vector-capable databases.