Manjunath

Posted on Mar 9

Vector Search & Code Embeddings: Building a Smart Knowledge Base with LangChain and FAISS

#vectordatabase #rag #voiceai #langchain

Hey dev.to community! 👋

Last few months, I've been working on an open-source project called Intervo.ai—a voice agent platform for building interactive voice experiences.

Early on, I faced a common issue: How can I build a smart, queryable knowledge base from tons of unstructured data? Enter the world of Vector Search, Embeddings, LangChain, and FAISS.

In this detailed guide, I'll share exactly how I built this, some mistakes I made (hint: starting with JavaScript wasn't my brightest idea), and comprehensive Python code you can use immediately. We'll go from basic setup all the way through to advanced usage.

What is LangChain?

LangChain is a robust framework designed to simplify developing AI-powered applications. It handles complex workflows involving language models, embeddings, context management, and integration with various vector databases. Essentially, LangChain removes the headache of manually wiring up components so you can focus on the interesting part—building your application logic.

I initially started with LangChain.js, hoping to leverage my JavaScript expertise. But I soon realized that there is limited documentation and fewer features compared to its Python counterpart. I then decided to switch to Python. This turned out to be the right decision—Python's ecosystem around LangChain is richer, better maintained, and supported by extensive community examples.

Quick Primer: Understanding Vectors and Embeddings

Vector embeddings are numerical representations of data (e.g., code snippets, documents, user queries) that capture the semantic meaning. These vectors position similar data points closer in vector space, making similarity searches efficient and accurate.

FAISS (Facebook AI Similarity Search) is an optimized vector database library designed for high-performance similarity searches. It efficiently manages millions of vectors, making it ideal for both prototyping and scaling production apps.

Step-by-Step Setup

Step 1: Setting up Your Python Environment

First, create and activate your Python environment. Then, install the necessary packages:

pip install langchain faiss-cpu openai python-dotenv

Package Breakdown:

langchain: Manages chaining of language models, embedding generation, and simplifies integration with vector stores.
faiss-cpu: Provides lightning-fast similarity search capabilities optimized for CPU.
openai: Enables easy interaction with OpenAI's API to generate embeddings.
python-dotenv: Conveniently manages environment variables like API keys.

Your requirements.txt:

langchain
faiss-cpu
openai
python-dotenv

Step 2: Create an Environment File

Save your OpenAI API key securely in a .env file:

OPENAI_API_KEY=your_openai_api_key_here

Building Your RAG Service

Understanding the Components:

Trainer: Prepares your data by splitting it into semantic chunks, embedding these chunks, and storing them in FAISS.
Query: Retrieves the most relevant data chunks based on user queries by leveraging similarity search.

Chunking Strategy

Effective chunking ensures optimal results from vector searches. LangChain's RecursiveCharacterTextSplitter splits text intelligently without breaking the semantic context, making it ideal for this task.

Here's the complete implementation:

import os
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document

load_dotenv()

class RagService:
    def __init__(self, embedding_model="text-embedding-ada-002"):
        self.text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
        self.embeddings = OpenAIEmbeddings(model=embedding_model, openai_api_key=os.getenv("OPENAI_API_KEY"))
        self.vectorstore = None

    def train_from_string(self, input_string):
        document = Document(page_content=input_string)
        chunks = self.text_splitter.split_documents([document])
        self.vectorstore = FAISS.from_documents(chunks, self.embeddings)
        self.vectorstore.save_local("faiss_index")

    def query(self, query_text, top_k=5):
        if not self.vectorstore:
            self.vectorstore = FAISS.load_local("faiss_index", self.embeddings)

        results = self.vectorstore.similarity_search(query_text, k=top_k)
        return results

Practical Example Usage

Here's how you'd practically integrate and run the above class:

from rag_service import RagService

# Initialize the service
rag_service = RagService()

# Train your model with a detailed string input
training_data = """
React is a popular JavaScript library for building user interfaces. It manages state efficiently using hooks like useState, useEffect, and useReducer. Global state management can be handled through libraries like Redux, MobX, Zustand, or the built-in Context API.
"""
rag_service.train_from_string(training_data)

# Perform a query
query_result = rag_service.query("How do you manage global state in React?")

print("Relevant results:")
for result in query_result:
    print("-", result.page_content)

Running Your Example

Store your class in rag_service.py. Then execute the example by creating run_example.py with the provided script and run:

python run_example.py

Personal Reflection

Initially, I anticipated building a RAG service to be challenging, but the combination of LangChain and FAISS dramatically streamlined the process. Switching to Python from JavaScript was a pivotal moment that highlighted the importance of selecting the right tool ecosystem for your needs.

Through this journey, I realized the immense potential of embeddings and vector databases in creating responsive, intelligent systems that feel genuinely "smart".

A Subtle Plug for Intervo

If you’re excited about creating smart conversational systems or voice-enabled experiences, check out Intervo. It’s a soon-to-be-released open-source project aimed at simplifying the development of voice assistants and interactive voice-based applications.

Thank you for sticking with me through this detailed guide! I'm curious—how are you using vector databases or LangChain in your projects? I'd love to chat about your experiences or answer any questions you might have. 🚀

DEV Community