Simplr

Posted on Feb 2 • Originally published at blog.simplr.sh

Storing and Querying OpenAI Embeddings in PostgreSQL with pg_vector

#ai #openai #postgres #tutorial

In this guide, we'll explore how to effectively store, index, and query embeddings generated from OpenAI's text-embedding-3-small model using PostgreSQL's pg_vector extension. This approach is particularly powerful for building semantic search and similarity matching systems.

Why PostgreSQL with pg_vector?

Native Vector Operations: pg_vector allows PostgreSQL to handle vector operations natively
Efficient Indexing: Supports fast similarity searches using IVFFlat indexes
Production Ready: Scales well with large datasets
Cost-Effective: Cheaper than specialized vector databases

Setting Up PostgreSQL with pg_vector

First, let's set up our database with the necessary extension and schema:

-- Enable the vector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table for our embeddings
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding vector(1536), -- dimension size for text-embedding-3-small
    metadata JSONB,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Create an IVFFlat index for faster similarity searches
CREATE INDEX ON documents 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

TypeScript Implementation

Here's a complete implementation showing how to store and query embeddings:

import { Pool } from "pg";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

interface Document {
  id?: number;
  content: string;
  embedding?: number[];
  metadata?: Record<string, any>;
}

class VectorStore {
  private pool: Pool;
  private embeddings: OpenAIEmbeddings;

  constructor() {
    this.pool = new Pool({
      host: "localhost",
      database: "your_database",
      user: "your_user",
      password: "your_password",
    });

    this.embeddings = new OpenAIEmbeddings({
      modelName: "text-embedding-3-small",
    });
  }

  async addDocuments(documents: Document[]) {
    const client = await this.pool.connect();

    try {
      await client.query("BEGIN");

      for (const doc of documents) {
        // Generate embedding for the document
        const [embedding] = await this.embeddings.embedDocuments([doc.content]);

        // Insert document and embedding
        await client.query(
          `
          INSERT INTO documents (content, embedding, metadata)
          VALUES ($1, $2, $3)
          `,
          [doc.content, embedding, doc.metadata || {}]
        );
      }

      await client.query("COMMIT");
    } catch (error) {
      await client.query("ROLLBACK");
      throw error;
    } finally {
      client.release();
    }
  }

  async similaritySearch(query: string, k: number = 5) {
    // Generate embedding for the query
    const [queryEmbedding] = await this.embeddings.embedDocuments([query]);

    // Perform similarity search
    const result = await this.pool.query(
      `
      SELECT 
        content,
        metadata,
        1 - (embedding <=> $1) as similarity
      FROM documents
      ORDER BY embedding <=> $1
      LIMIT $2
      `,
      [queryEmbedding, k]
    );

    return result.rows;
  }
}

// Usage Example
async function main() {
  const vectorStore = new VectorStore();

  // Adding documents
  await vectorStore.addDocuments([
    {
      content: "TypeScript is a strongly typed programming language.",
      metadata: { category: "programming", language: "typescript" },
    },
    {
      content: "PostgreSQL is a powerful open-source database.",
      metadata: { category: "database", type: "relational" },
    },
  ]);

  // Performing similarity search
  const results = await vectorStore.similaritySearch(
    "What programming languages are typed?"
  );
  console.log(results);
}

Understanding the Implementation

1. Table Structure

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding vector(1536),
    metadata JSONB,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

content: Stores the original text
embedding: Vector representation from OpenAI
metadata: Flexible JSONB field for additional information
created_at: Timestamp for tracking

2. Adding Documents

async addDocuments(documents: Document[]) {
  // ... transaction handling ...

  // Generate embedding for the document
  const [embedding] = await this.embeddings.embedDocuments([doc.content]);

  // Insert document and embedding
  await client.query(
    `INSERT INTO documents (content, embedding, metadata)
     VALUES ($1, $2, $3)`,
    [doc.content, embedding, doc.metadata || {}]
  );
}

This method:

Opens a transaction
Generates embeddings using OpenAI
Stores both content and embeddings
Uses transactions for data integrity

3. Similarity Search

async similaritySearch(query: string, k: number = 5) {
  const [queryEmbedding] = await this.embeddings.embedDocuments([query]);

  return await this.pool.query(
    `SELECT 
      content,
      metadata,
      1 - (embedding <=> $1) as similarity
    FROM documents
    ORDER BY embedding <=> $1
    LIMIT $2`,
    [queryEmbedding, k]
  );
}

Key points:

<=> is the cosine distance operator
1 - (embedding <=> $1) converts distance to similarity score
Uses the IVFFlat index for efficient searching

Performance Optimization Tips

Batch Insertions: When adding many documents, use COPY or batch inserts
Index Tuning: Adjust the lists parameter based on your dataset size:

   -- Rule of thumb: lists = sqrt(number_of_rows)
   CREATE INDEX ON documents 
   USING ivfflat (embedding vector_cosine_ops)
   WITH (lists = 100);

Connection Pooling: Use connection pooling for better performance
Regular VACUUM: Schedule regular VACUUM operations to maintain index efficiency

Example Usage

// Initialize the vector store
const vectorStore = new VectorStore();

// Add documents
await vectorStore.addDocuments([
  {
    content: "TypeScript adds optional static types to JavaScript",
    metadata: { topic: "programming" }
  }
]);

// Search for similar content
const results = await vectorStore.similaritySearch(
  "What is static typing?",
  5
);

console.log(results);

Conclusion

Using PostgreSQL with pg_vector provides a robust, scalable solution for storing and querying embeddings. The implementation above gives you a foundation for building semantic search systems, recommendation engines, or any application requiring vector similarity search.

Remember to:

Monitor index performance
Tune parameters based on your dataset
Keep your PostgreSQL and pg_vector versions updated
Consider batch operations for large datasets

This setup provides an excellent balance of performance, cost, and maintainability for production systems working with embeddings.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community