DEV Community

Akintola Abiodun
Akintola Abiodun

Posted on

Building a simplified Retrieval Augmented Generation System with Supabase Storage and OpenAI Embeddings in Next.js

First off, please forgive me for the ridiculously long title. 😅 I’ll be the first to admit, I’m terrible at naming things. Well… except for variables, of course! (Shoutout to all my beautifully named data1 and data2 variables. 💀)

Okay, enough about my ironic talent for naming things. Let’s dive into today’s topic: building a simplified RAG system.

RAG (Retrieval-Augmented Generation) is an incredibly useful technology with a wide range of applications. The ability to query your organization’s data using AI is a game-changer! It’s a breath of fresh air for anyone dealing with large amounts of information.

If you’re new to this, you might wonder how it works or how to start building your own RAG system. Well, today is your lucky day! We will explore the key concepts and techniques for developing one.

Imagine a hypothetical scenario in which a school owner manages a database filled with student records, tuition payments, and salary entries for staff. Searching for specific information can be tedious and strenuous, often requiring complex queries.

This is where a RAG based solution comes in! Instead of manually searching or writing complex SQL queries, the school could simply use an application or chatbot that allows them to type what they need like "show all pending tuition payments", and get instant results from their database. That’s a huge relief, eliminating the need for traditional manual searches and complicated queries.

With this in mind, let’s dive into building a basic RAG system using Supabase and OpenAI embeddings!

We will essentially create a system that:

  • Uploads and processes different types of files (PDF, DOCX, XLSX, TXT) Converts document content into embeddings (more on this later!)
  • Stores files and their embeddings( We will use these embeddings to generate the most probable response for the RAG chatbot we are creating.)
  • Finds relevant content using semantic search(using the mathematical cosine similarity to compare embeddings)
  • Powers a smart chatbot that can answer questions about your documents

Before we dive in, make sure you have:

  • A Next.js project set up
  • A Supabase account and project
  • An OpenAI API key
  • A basic understanding of TypeScript

First, let's install the necessary packages:

npm install @supabase/supabase-js @ai-sdk/openai ai xlsx pdf-parse mammoth tesseract.js
Enter fullscreen mode Exit fullscreen mode

Implementation

Step 1: File Upload and Storage

First, let's create a component to handle file uploads:

'use client'

import { useState } from 'react'
import { useDropzone } from 'react-dropzone'
import { uploadMedia } from '@/lib/upload-media'

// The Artifact will be the organization file that is being uploaded
interface Artifact {
  id?: number
  file?: File
  name: string
  description: string
  uploaded: boolean
}

export default function FileUpload() {
  const [isUploading, setIsUploading] = useState(false)

  // Handle file upload and embedding generation
  const uploadArtifact = async (artifact: Artifact) => {
    if (!artifact.file) return
    setIsUploading(true)

    try {
      // First, upload file to Supabase storage
      const mediaPath = await uploadMedia(artifact.file)

      if (!mediaPath) {
        throw new Error('Failed to upload file')
      }

      // Generate embeddings via API route
      const formData = new FormData()
      formData.append("mediaPath", mediaPath)

      const response = await fetch("/api/generate-embeddings", {
        method: "POST",
        body: formData,
      })

      const data = await response.json()

      if (!data.success) {
        throw new Error('Failed to generate embeddings')
      }

      // Success! You can now store the embeddings in your database
      console.log('Embeddings generated:', data.embedding)
      //please store your generated embeddings in your database :)
    } catch (error) {
      console.error('Error:', error)
    } finally {
      setIsUploading(false)
    }
  }

  // Set up drag & drop
  const onDrop = useCallback((acceptedFiles: File[]) => {
    const validFileTypes = [
      'text/plain',
      'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
      'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
      'application/pdf'
    ]

    const validFiles = acceptedFiles.filter(file => validFileTypes.includes(file.type))

    if (validFiles.length > 0) {
      // Process each valid file
      validFiles.forEach(file => {
        uploadArtifact({
          file,
          name: file.name,
          description: '',
          uploaded: false
        })
      })
    }
  }, [])

  const { getRootProps, getInputProps } = useDropzone({
    onDrop,
    accept: {
      'text/plain': ['.txt'],
      'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'],
      'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'],
      'application/pdf': ['.pdf']
    }
  })

  return (
    <div>
      <div {...getRootProps()} className="border-2 border-dashed p-8 text-center">
        <input {...getInputProps()} />
        <p>Drop files here or click to select</p>
      </div>
      {isUploading && <p>Processing files...</p>}
    </div>
  )
}
Enter fullscreen mode Exit fullscreen mode

Now let us create the upload function that handles the Supabase storage:

// lib/upload-media.ts
import { supabase } from '@/lib/utils'

export async function uploadMedia(file: File) {
  // Upload file to Supabase storage with a unique name
  const { data, error } = await supabase.storage
    .from('media')
    .upload(`${Date.now()}-${file.name}`, file)

  if (error) {
    console.error('Error uploading file:', error)
    return null
  }

  return data.path
}

Enter fullscreen mode Exit fullscreen mode

Step 2: Understanding Embeddings

OK Before we dive deeper, we should understand what embeddings are right?

Think of embeddings as a way to convert words or sentences into numbers while keeping their meaning intact. Imagine you are organizing a library or lets say your personal bookshelf at home, but instead of sorting books alphabetically, you arrange them based on how similar their content is. That’s exactly what embeddings do with text.

A Simple Example:
Let’s say we have two sentences:

-"I love play football"
-"Football is my favourite sport"

These two sentences mean almost the same thing, right? Even though the exact words aren’t identical, the meaning is very similar. Embeddings capture that similarity and assign them closely related numbers in a mathematical space.

Now, compare these:

-"Walahi I would love to eat noodles right now"

  • "I love to play football"

Ignoring my subtle hint that I'm currently craving noodles while writing this, these sentences have completely different meanings, so their embeddings would be very far apart in that space.

How Does This Help?
Now, you might be wondering, why does this even matter? Well, this is the magic behind how AI understands and finds similar content, even when the exact words don’t match word by word!

Let us use our initial example, imagine you run a school database where teachers and staff need to look up student records. Instead of typing the exact name of a document, they can search naturally:

"Show me unpaid tuition fees"

Even if the actual database entry is labeled as "Pending student invoices", the system can still match the two because their embeddings are similar, No need for exact keywords, just ask in your own words, and AI will figure it out, kinda wild if you ask me.

Why Does This Matter for RAG?
In our RAG (Retrieval-Augmented Generation) system, we’ll use embeddings to:
-Store important information as numerical representations
Compare new queries(that is the user's input) against existing stored data
-Retrieve the most relevant content for AI to generate the best response.
So basically, embeddings help AI "think" more like humans by recognizing meaning instead of just matching words, yh yh I know, mind-blowing lol.

Moving on, Here's our API route for generating embeddings

// app/api/generate-embeddings/route.ts
import { createEmbedding } from '@/lib/create-embeddings'
import { type NextRequest, NextResponse } from 'next/server'
export const maxDuration = 60
export async function POST(req: NextRequest) {
  try {
    const formData = await req.formData()
    const filePath = formData.get('mediaPath') as string

    if (!filePath) {
      return NextResponse.json(
        { error: 'Missing file path' },
        { status: 400 }
      )
    }
    // Generate embeddings for the file content
    const embedding = await createEmbedding(filePath)
    return NextResponse.json({ success: true, embedding })
  } catch (error) {
    console.error('Error:', error)
    return NextResponse.json(
      { error: 'Internal server error' },
      { status: 500 }
    )
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Processing Different File Types

Different file types need different handling, we can not use the same process of extracting content from pdfs to that of xlsx, docx, etc. Here's how we extract text from various formats:

// lib/create-embeddings.ts
import { createOpenAI } from '@ai-sdk/openai';
import { createClient } from '@supabase/supabase-js';
import { embed } from 'ai';
import { execSync } from 'child_process';
import fs from 'fs';
import mammoth from 'mammoth';
import os from 'os';
import path from 'path';
import PdfParse from 'pdf-parse';
import Tesseract from 'tesseract.js';
import * as XLSX from 'xlsx';
import { chunkDocumentWithOverlap } from './chunkDocument';

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_ANON_KEY!
);

const openai = createOpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// Main function to create embeddings
export async function createEmbedding(filePath: string) {
  const { data, error } = await supabase.storage
    .from('media')
    .download(filePath)
  if (error || !data) throw new Error('Failed to download file')
  // Extract text from the file
  const text = await extractTextFromFile(data, filePath)

  // Split text into chunks for better processing
  const chunks = chunkDocumentWithOverlap(text)

  // Generate embeddings for each chunk
  const chunksWithEmbeddings = await Promise.all(
    chunks.map(async (chunk) => {
      const { embedding } = await embed({
        model: openai.embedding('text-embedding-3-small'),
        value: chunk,
      })
      return { embedding, content: chunk }
    })
  )
  return chunksWithEmbeddings
}

async function extractTextFromFile(
  fileData: Blob,
  filePath: string
): Promise<string> {
  const fileExtension = path.extname(filePath).toLowerCase();
  const tempFilePath = path.join(os.tmpdir(), `temp_file${fileExtension}`);

  console.log(`Temp file path: ${tempFilePath}`);

  // Write the blob to a temporary file
  try {
    await fs.promises.writeFile(
      tempFilePath,
      Buffer.from(await fileData.arrayBuffer())
    );
    console.log('File written successfully.');
  } catch (err) {
    console.error('Error writing file:', err);
    throw new Error('Failed to write temporary file.');
  }

  // Check if the file exists
  const fileExists = fs.existsSync(tempFilePath);
  if (!fileExists) {
    console.error('Temp file was not created.');
    throw new Error('File was not created successfully.');
  }

  console.log('File exists:', fileExists);

  try {
    switch (fileExtension) {
      case '.pdf':
        return await extractTextFromPDF(tempFilePath);
      case '.docx':
        return await extractTextFromDOCX(tempFilePath);
      case '.xlsx':
        return extractTextFromXLSX(tempFilePath);
      case '.txt':
        console.log('Reading .txt file...');
        const textContent = await fs.promises.readFile(tempFilePath, 'utf-8');
        console.log('Extracted text:', textContent);
        return textContent;
      default:
        throw new Error('Unsupported file type');
    }
  } catch (err) {
    console.error('Error extracting text:', err);
    throw new Error('Failed to extract text.');
  } finally {
    // Clean up the temporary file
    try {
      await fs.promises.unlink(tempFilePath);
      console.log('Temp file deleted.');
    } catch (err) {
      console.error('Error deleting temp file:', err);
    }
  }
}

//function to clean up the text
const cleanText = (text : string) => {
  return text
    .replace(/\x00/g, '') // Remove NULL bytes
    .replace(/[^\x20-\x7E\n]/g, ''); // Remove non-ASCII characters (optional)
};

// extract the text from pdf
async function extractTextFromPDF(filePath: string): Promise<string> {
  try {
    console.log('Trying pdf-parse...');
    const dataBuffer = await fs.promises.readFile(filePath);
    const data = await PdfParse(dataBuffer);

    let text = data.text.replace(/\uFFFD/g, '').trim(); // Remove invalid characters

    if (text && text.length > 20) {
      console.log('Extracted text successfully using pdf-parse.');
      return cleanText(text);
    }

    throw new Error('Extracted text is empty or invalid.');
  } catch (error) {
    console.warn('pdf-parse failed:', error);
  }

  // Fallback: Using pdftotext (requires poppler-utils installed)
  try {
    console.log('Trying pdftotext...');
    const extractedText = execSync(`pdftotext -layout "${filePath}" -`, {
      encoding: 'utf-8',
    }).trim();
    if (extractedText.length > 20) {
      console.log('Extracted text successfully using pdftotext.');
      return extractedText;
    }

    throw new Error('pdftotext extraction returned empty text.');
  } catch (error) {
    console.warn('pdftotext failed:', error);
  }

  // Final Fallback: OCR with Tesseract.js (for scanned PDFs)
  try {
    console.log('Trying OCR (Tesseract.js)...');
    const {
      data: { text },
    } = await Tesseract.recognize(filePath, 'eng', {
      logger: (m) => console.log(m), // Log OCR progress
    });

    if (text.length > 20) {
      console.log('Extracted text successfully using OCR.');
      return text;
    }

    throw new Error('OCR extraction returned empty text.');
  } catch (error) {
    console.error('OCR extraction failed:', error);
  }

  throw new Error('Failed to extract text from PDF using all methods.');
}

//extract text from docx files
async function extractTextFromDOCX(filePath: string): Promise<string> {
  try {
    const result = await mammoth.extractRawText({ path: filePath });
    return result.value;
  } catch (error) {
    console.error('Error parsing DOCX:', error);
    throw new Error('Failed to extract text from DOCX');
  }
}

//extract text from xlsx files
function extractTextFromXLSX(filePath: string): string {
  try {
    const workbook = XLSX.readFile(filePath);
    let text = '';
    workbook.SheetNames.forEach((sheetName) => {
      const sheet = workbook.Sheets[sheetName];
      text += XLSX.utils.sheet_to_csv(sheet) + '\n\n';
    });
    return text;
  } catch (error) {
    console.error('Error parsing XLSX:', error);
    throw new Error('Failed to extract text from XLSX');
  }
}

Enter fullscreen mode Exit fullscreen mode

Step 4: Chunking Large Documents

For large documents, we split them into smaller chunks with some overlap to still maintain some sort of context:


//lib/chunkDocumentWithOverlap
const MAX_CHUNK_SIZE = 2000;
const OVERLAP = 200;

export function chunkDocumentWithOverlap(text: string): string[] {
  const chunks: string[] = [];
  let startIndex = 0;

  while (startIndex < text.length) {
    let endIndex = Math.min(startIndex + MAX_CHUNK_SIZE, text.length);

    console.log(
      'Start:',
      startIndex,
      'End:',
      endIndex,
      'Total Length:',
      text.length,
      'Chunks Count:',
      chunks.length
    );

    if (endIndex < text.length) {
      // Try to find a space to break at, moving left from `endIndex`
      let breakPoint = endIndex;
      while (breakPoint > startIndex && text[breakPoint] !== ' ') {
        breakPoint--;
      }

      // If no space was found, keep the original `endIndex`
      if (breakPoint > startIndex) {
        endIndex = breakPoint + 1; // Include the space
      }
    }

    chunks.push(text.slice(startIndex, endIndex));

    // Move `startIndex` forward but ensure we make progress
    startIndex = endIndex;
  }

  return chunks;
}
Enter fullscreen mode Exit fullscreen mode

Step 5: Finding Similar Content

When a user asks a question, we need to find the most similar content from the embeddings, so what we will do is we will turn the user's input into an embedding, and then look for similar embeddings to it using cosine similarity to compare vectors. Here's how we do it:

// lib/similarity.ts
import { openai } from '@ai-sdk/openai';

//make sure OPENAI_API_KEY is set up in your env
// Calculate similarity between two vectors using cosine similarity
export function cosineSimilarity(a: number[], b: number[]): number {
  // Calculate dot product
  const dotProduct = a.reduce((sum, _, i) => sum + a[i] * b[i], 0)

  // Calculate magnitudes
  const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0))
  const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0))

  // Return similarity score
  return dotProduct / (magnitudeA * magnitudeB)
}

// Find most similar documents
export async function findMostSimilarArtifacts(
  input: string,
  artifacts: Chunk[],
  count: number
): Promise<Chunk[]> {
  // Get embedding for user input
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: input,
  });


  // Calculate similarity scores and sort
  return artifacts
    .map((artifact) => ({
      ...artifact,
      similarity: cosineSimilarity(embedding, artifact.embedding),
    }))
    .sort((a, b) => b.similarity - a.similarity)
    .slice(0, count)
}
Enter fullscreen mode Exit fullscreen mode

Building the AI Chatbot Interface

Now let's create an interactive chatbot that can answer questions about your documents using the AI SDK's useChat hook. This will essentially tie everything together!

// components/DocumentChat.tsx
'use client'

import { useChat } from 'ai/react'
import { useState } from 'react'
import { Button } from "@/components/ui/button"
import { Card, CardContent, CardFooter, CardHeader, CardTitle } from "@/components/ui/card"
import { Input } from "@/components/ui/input"
import { ScrollArea } from "@/components/ui/scroll-area"
import { Send } from 'lucide-react'

export default function DocumentChat() {
  // Initialize the chat hook with our API endpoint
  const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat({
    api: '/api/chat',
    // Initialize with a helpful system message
    initialMessages: [
      {
        id: 'welcome',
        role: 'system',
        content: "I'm your document assistant. Ask me anything about your uploaded documents!",
      },
    ],
  })

  return (
    <Card className="w-full max-w-2xl mx-auto">
      <CardHeader>
        <CardTitle>Document Assistant</CardTitle>
      </CardHeader>

      <CardContent>
        <ScrollArea className="h-[600px] pr-4">
          <div className="flex flex-col gap-4">
            {messages.map((message) => (
              <div
                key={message.id}
                className={`flex ${
                  message.role === 'user' ? 'justify-end' : 'justify-start'
                }`}
              >
                <div
                  className={`rounded-lg px-4 py-2 max-w-[80%] ${
                    message.role === 'user'
                      ? 'bg-primary text-primary-foreground'
                      : 'bg-muted'
                  }`}
                >
                  {message.content}
                </div>
              </div>
            ))}
            {isLoading && (
              <div className="flex justify-start">
                <div className="rounded-lg px-4 py-2 bg-muted">
                  Thinking...
                </div>
              </div>
            )}
          </div>
        </ScrollArea>
      </CardContent>

      <CardFooter>
        <form onSubmit={handleSubmit} className="flex w-full gap-2">
          <Input
            value={input}
            onChange={handleInputChange}
            placeholder="Ask a question about your documents..."
            disabled={isLoading}
          />
          <Button type="submit" disabled={isLoading}>
            <Send className="h-4 w-4" />
          </Button>
        </form>
      </CardFooter>
    </Card>
  )
}
Enter fullscreen mode Exit fullscreen mode

Now, let's create the API route that powers our chatbot:

// app/api/chat/route.ts
import { createOpenAI } from '@ai-sdk/openai'
import { embed, streamText } from 'ai'
import { NextResponse, type NextRequest } from 'next/server'
import { findMostSimilarArtifacts } from '@/lib/similarity'

// Initialize OpenAI
const openai = createOpenAI({
  apiKey: process.env.OPENAI_API_KEY,
})

export async function POST(req: NextRequest) {
  try {
    const { messages } = await req.json()
    const lastMessage = messages[messages.length - 1]

//get storedArtifacts from your storage first and then find the similar artifacts to be used as context.
const relevantDocs = await findMostSimilarArtifacts(lastMessage, storedArtifacts, 3)

const contextPromises = relevantDocs.map(async (a) => {
    return `${a.articleName} (${a.content})`
})
//get the context to use 
  const context = (await Promise.all(contextPromises)).join("\n\n")
const result = streamText({
      model: openai('gpt-4o-mini'),
      messages,
      temperature: 0.7,
      maxTokens: 1000,
      system: `You are an AI assistant helping with retrieval augmentation. Provide concise and relevant information based on the context provided. the context : ${context}`,
    });

    return result.toDataStreamResponse();
    }catch (error) {
    console.error('Error in chat route:', error);
    return new Response('An error occurred while processing your request', {
      status: 500,
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

So there you go viola, your own working rag system that answers questions based on your document data.

How It All Works Together

so from the top,

  1. User uploads a document
  2. Document is stored in Supabase
  3. Text is extracted and split into chunks,
  4. Each chunk gets converted to an embedding
  5. Embeddings are stored with their text
  6. When a user asks a question:

  7. The question is converted to an embedding

  8. System finds similar chunks using cosine similarity

  9. Relevant content is used to generate an informed answer

And that’s a wrap, You now have the foundation to build a powerful RAG system using Supabase and OpenAI embeddings. With this setup, you can store and retrieve knowledge efficiently, making your chatbot smarter and more helpful.

The journey definitely should not stop here, keep experimenting, optimizing, and pushing the limits of what AI can do!

Aja Aja Fighting! 💪

Top comments (0)