Davide Santangelo

Posted on Mar 2

Building a Vector Database in Ruby: A Comprehensive Guide

#ruby #programming #tutorial #machinelearning

In the era of artificial intelligence and machine learning, vector databases have emerged as a critical component for applications that require similarity search, recommendation engines, and natural language processing. This article provides a detailed walkthrough on creating a vector database in Ruby, complete with practical code examples, benchmarks, and integration with AI systems.

Introduction to Vector Databases
Vector Database Fundamentals
Building a Simple Vector Database in Ruby
Advanced Features Implementation
Benchmarking and Performance Optimization
Integration with AI Models
Production Considerations
Comparison with Existing Solutions
Conclusion

Introduction to Vector Databases

Vector databases are specialized systems designed to store and query high-dimensional vectors, which are mathematical representations of data points in a multi-dimensional space. Unlike traditional relational databases that excel at exact matching queries, vector databases are optimized for similarity searches based on vector distances.

Why Vector Databases Matter

The rise of embeddings in machine learning has made vector databases increasingly important. Embeddings transform complex data (text, images, audio) into numerical vector representations that capture semantic relationships. When stored in a vector database, these vectors enable powerful similarity searches.

Use Cases

Semantic search engines
Recommendation systems
Image similarity search
Natural language processing applications
Anomaly detection
Face recognition

Vector Database Fundamentals

Before diving into implementation, let's understand the core concepts of vector databases:

Vectors and Embeddings

A vector is simply an array of numbers that represents a point in a multi-dimensional space. In machine learning contexts, these vectors are often called embeddings, which are dense numerical representations of data created by models like Word2Vec, BERT, or other neural networks.

Distance Metrics

Vector databases rely on distance metrics to measure similarity between vectors:

Euclidean Distance: The straight-line distance between two points in Euclidean space
Cosine Similarity: Measures the cosine of the angle between two vectors
Manhattan Distance: The sum of absolute differences between points across all dimensions
Dot Product: For normalized vectors, provides a similarity measure

Indexing Structures

For efficient similarity searches, vector databases use specialized indexing structures:

Brute Force: Compares query vector with all vectors in the database
KD-Trees: Binary trees that partition space along dimensions
LSH (Locality-Sensitive Hashing): Hashes similar vectors to the same buckets
HNSW (Hierarchical Navigable Small World): Creates a graph structure for efficient navigation
Annoy: Uses random projection trees for approximate nearest neighbor search

Building a Simple Vector Database in Ruby

Now, let's implement a basic vector database in Ruby. We'll start with a simple in-memory implementation and gradually add more features.

Basic In-Memory Vector Store

class VectorStore
  def initialize(distance_metric = :cosine)
    @vectors = {}
    @distance_metric = distance_metric
  end

  def add(id, vector)
    validate_vector(vector)
    @vectors[id] = vector
  end

  def get(id)
    @vectors[id]
  end

  def search(query_vector, k = 5)
    validate_vector(query_vector)

    @vectors.map do |id, vector|
      distance = calculate_distance(query_vector, vector)
      [id, distance]
    end.sort_by { |_, distance| distance }
      .take(k)
  end

  private

  def validate_vector(vector)
    raise ArgumentError, "Vector must be an Array" unless vector.is_a?(Array)
    raise ArgumentError, "Vector must contain only numbers" unless vector.all? { |v| v.is_a?(Numeric) }
  end

  def calculate_distance(vec1, vec2)
    case @distance_metric
    when :cosine
      cosine_distance(vec1, vec2)
    when :euclidean
      euclidean_distance(vec1, vec2)
    when :manhattan
      manhattan_distance(vec1, vec2)
    else
      raise ArgumentError, "Unsupported distance metric: #{@distance_metric}"
    end
  end

  def cosine_distance(vec1, vec2)
    dot_product = vec1.zip(vec2).sum { |a, b| a * b }
    magnitude1 = Math.sqrt(vec1.map { |v| v**2 }.sum)
    magnitude2 = Math.sqrt(vec2.map { |v| v**2 }.sum)

    return 1.0 if magnitude1 == 0 || magnitude2 == 0
    1 - (dot_product / (magnitude1 * magnitude2))
  end

  def euclidean_distance(vec1, vec2)
    Math.sqrt(vec1.zip(vec2).sum { |a, b| (a - b)**2 })
  end

  def manhattan_distance(vec1, vec2)
    vec1.zip(vec2).sum { |a, b| (a - b).abs }
  end
end

Usage Example

# Create a new vector store
store = VectorStore.new(:cosine)

# Add some vectors
store.add("doc1", [0.2, 0.3, 0.4, 0.1])
store.add("doc2", [0.3, 0.2, 0.1, 0.4])
store.add("doc3", [0.1, 0.2, 0.3, 0.4])
store.add("doc4", [0.5, 0.5, 0.2, 0.1])

# Search for similar vectors
query = [0.2, 0.2, 0.3, 0.3]
results = store.search(query, 2)
puts "Search results for #{query}:"
results.each do |id, distance|
  puts "  #{id}: #{distance}"
end

Advanced Features Implementation

Let's enhance our vector database with more advanced features like persistence, batch operations, and approximate nearest neighbor search.

Persistent Storage with SQLite

First, let's implement a SQLite-based storage backend:

require 'sqlite3'
require 'json'

class PersistentVectorStore
  def initialize(db_path, distance_metric = :cosine)
    @db = SQLite3::Database.new(db_path)
    @distance_metric = distance_metric

    setup_database
  end

  def setup_database
    @db.execute <<-SQL
      CREATE TABLE IF NOT EXISTS vectors (
        id TEXT PRIMARY KEY,
        vector BLOB NOT NULL,
        metadata TEXT
      );
    SQL

    @db.execute <<-SQL
      CREATE INDEX IF NOT EXISTS idx_vectors_id ON vectors(id);
    SQL
  end

  def add(id, vector, metadata = {})
    validate_vector(vector)

    @db.execute(
      "INSERT OR REPLACE INTO vectors VALUES (?, ?, ?)",
      [id, vector.to_json, metadata.to_json]
    )
  end

  def batch_add(items)
    @db.transaction
    begin
      items.each do |id, vector, metadata|
        add(id, vector, metadata || {})
      end
      @db.commit
    rescue => e
      @db.rollback
      raise e
    end
  end

  def get(id)
    result = @db.get_first_row(
      "SELECT vector FROM vectors WHERE id = ?",
      [id]
    )

    return nil unless result
    JSON.parse(result[0])
  end

  def get_with_metadata(id)
    result = @db.get_first_row(
      "SELECT vector, metadata FROM vectors WHERE id = ?",
      [id]
    )

    return nil unless result
    {
      vector: JSON.parse(result[0]),
      metadata: JSON.parse(result[1])
    }
  end

  def search(query_vector, k = 5, filter = nil)
    validate_vector(query_vector)

    # This is inefficient for large datasets
    # In a real implementation, we would use an index
    all_vectors = @db.execute("SELECT id, vector, metadata FROM vectors")

    results = all_vectors.map do |id, vector_json, metadata_json|
      vector = JSON.parse(vector_json)
      metadata = JSON.parse(metadata_json)

      # Apply filter if provided
      next if filter && !filter.call(metadata)

      distance = calculate_distance(query_vector, vector)
      [id, distance, metadata]
    end.compact

    results.sort_by { |_, distance, _| distance }.take(k)
  end

  def delete(id)
    @db.execute("DELETE FROM vectors WHERE id = ?", [id])
  end

  def count
    @db.get_first_value("SELECT COUNT(*) FROM vectors")
  end

  private

  # Reuse the validation and distance calculation methods from the previous implementation
  # ...
end

Implementing HNSW Index for Approximate Nearest Neighbor Search

For large datasets, a brute-force approach becomes inefficient. Let's implement a Hierarchical Navigable Small World (HNSW) index, which is one of the most efficient algorithms for approximate nearest neighbor search:

class HNSWIndex
  def initialize(dimension, m = 16, ef_construction = 200)
    @dimension = dimension
    @m = m  # Maximum number of connections per layer
    @ef_construction = ef_construction  # Size of the dynamic candidate list during construction
    @levels = {}  # Graph structure
    @entry_point = nil  # Entry point for search
    @max_level = 0  # Maximum level in the graph
  end

  def add(id, vector)
    # Generate random level for the new element
    level = random_level

    if @entry_point.nil?
      # This is the first element
      @entry_point = id
      @max_level = level

      # Initialize all levels up to 'level'
      (0..level).each do |l|
        @levels[l] ||= {}
        @levels[l][id] = { vector: vector, connections: [] }
      end

      return
    end

    # Start from the entry point and the highest level
    current_node = @entry_point

    # Search for the closest neighbors at each level, starting from the top
    ([@max_level, level].min).downto(level + 1).each do |l|
      current_node = search_at_level(vector, current_node, 1, l).first[:id]
    end

    # For the rest of the levels, add connections
    level.downto(0).each do |l|
      # Search for ef_construction nearest neighbors at the current level
      neighbors = search_at_level(vector, current_node, @ef_construction, l)

      # Select up to M nearest neighbors
      selected_neighbors = select_neighbors(vector, neighbors, @m)

      # Initialize the node at this level
      @levels[l] ||= {}
      @levels[l][id] = { vector: vector, connections: selected_neighbors.map { |n| n[:id] } }

      # Add bidirectional connections
      selected_neighbors.each do |neighbor|
        @levels[l][neighbor[:id]][:connections] ||= []
        @levels[l][neighbor[:id]][:connections] << id

        # Ensure we don't exceed M connections
        if @levels[l][neighbor[:id]][:connections].size > @m
          prune_connections(neighbor[:id], vector, l)
        end
      end

      # Update the current node for the next level
      current_node = id
    end

    # Update the entry point if the new element has a higher level
    if level > @max_level
      @entry_point = id
      @max_level = level
    end
  end

  def search(query_vector, k = 5)
    return [] if @entry_point.nil?

    current_node = @entry_point

    # Search from top level to bottom
    @max_level.downto(1).each do |l|
      current_node = search_at_level(query_vector, current_node, 1, l).first[:id]
    end

    # At the lowest level, search for k nearest neighbors
    results = search_at_level(query_vector, current_node, [k, @ef_construction].max, 0)

    # Return the top k results
    results.take(k)
  end

  private

  def random_level
    # Implementation of the level generation algorithm
    # This is a simplified version - a proper implementation would use a
    # level probability distribution based on the desired graph structure
    rand(0..4)
  end

  def search_at_level(query_vector, entry_node, ef, level)
    # Implementation of the search algorithm at a specific level
    # ...

    # This is a placeholder - a complete implementation would involve:
    # 1. Maintaining a priority queue of candidates
    # 2. Exploring the graph from the entry node
    # 3. Updating the candidates as we find closer neighbors

    # For now, we'll just do a linear search through all nodes at this level
    @levels[level].map do |id, node|
      distance = euclidean_distance(query_vector, node[:vector])
      { id: id, distance: distance }
    end.sort_by { |n| n[:distance] }.take(ef)
  end

  def select_neighbors(vector, candidates, m)
    # Simple greedy selection - take the M closest neighbors
    candidates.sort_by { |c| c[:distance] }.take(m)
  end

  def prune_connections(node_id, query_vector, level)
    # Keep only the M closest connections
    connections = @levels[level][node_id][:connections]
    distances = connections.map do |conn_id|
      {
        id: conn_id,
        distance: euclidean_distance(query_vector, @levels[level][conn_id][:vector])
      }
    end

    @levels[level][node_id][:connections] = distances
      .sort_by { |c| c[:distance] }
      .take(@m)
      .map { |c| c[:id] }
  end

  def euclidean_distance(vec1, vec2)
    Math.sqrt(vec1.zip(vec2).sum { |a, b| (a - b)**2 })
  end
end

Putting It All Together: A Complete Vector Database

Now, let's combine our persistent storage with the HNSW index to create a more complete vector database:

class VectorDatabase
  def initialize(db_path, dimension, distance_metric = :cosine)
    @store = PersistentVectorStore.new(db_path, distance_metric)
    @index = HNSWIndex.new(dimension)
    @dimension = dimension
  end

  def add(id, vector, metadata = {})
    validate_dimensions(vector)

    @store.add(id, vector, metadata)
    @index.add(id, vector)
  end

  def batch_add(items)
    @store.batch_add(items.map { |id, vector, metadata|
      validate_dimensions(vector)
      [id, vector, metadata]
    })

    items.each do |id, vector, _|
      @index.add(id, vector)
    end
  end

  def search(query_vector, k = 5, filter = nil)
    validate_dimensions(query_vector)

    # Use the index for approximate nearest neighbor search
    candidates = @index.search(query_vector, k * 4)  # Get more candidates than needed

    # Refine the results using the exact distance and metadata filter
    results = candidates.map do |candidate|
      id = candidate[:id]
      item = @store.get_with_metadata(id)

      next if filter && !filter.call(item[:metadata])

      [id, candidate[:distance], item[:metadata]]
    end.compact

    # Return the top k results
    results.sort_by { |_, distance, _| distance }.take(k)
  end

  def get(id)
    @store.get(id)
  end

  def get_with_metadata(id)
    @store.get_with_metadata(id)
  end

  def delete(id)
    @store.delete(id)
    # Note: HNSW doesn't support deletion, so we would need to rebuild the index
    # In a production system, we might use soft deletions or periodic index rebuilds
  end

  def count
    @store.count
  end

  private

  def validate_dimensions(vector)
    raise ArgumentError, "Vector must have #{@dimension} dimensions" unless vector.size == @dimension
  end
end

Benchmarking and Performance Optimization

Let's benchmark our vector database against different dataset sizes and query patterns to understand its performance characteristics.

Benchmark Setup

require 'benchmark'
require_relative 'vector_database'

def random_vector(dim)
  Array.new(dim) { rand }
end

def run_benchmark(dimension, num_vectors, num_queries)
  db_path = "benchmark_#{dimension}d_#{num_vectors}v.db"
  File.unlink(db_path) if File.exist?(db_path)

  db = VectorDatabase.new(db_path, dimension)

  # Generate random vectors
  vectors = num_vectors.times.map { |i| ["item_#{i}", random_vector(dimension), {}] }

  # Generate random queries
  queries = num_queries.times.map { random_vector(dimension) }

  puts "Benchmarking with #{dimension}D vectors, #{num_vectors} items, #{num_queries} queries"

  # Measure insertion time
  insert_time = Benchmark.measure do
    db.batch_add(vectors)
  end

  puts "  Insertion: #{insert_time.real.round(2)} seconds (#{(num_vectors / insert_time.real).round(2)} vectors/sec)"

  # Measure query time
  total_query_time = 0

  queries.each do |query|
    query_time = Benchmark.measure do
      db.search(query, 10)
    end
    total_query_time += query_time.real
  end

  avg_query_time = total_query_time / num_queries
  puts "  Average query time: #{avg_query_time.round(4)} seconds (#{(1 / avg_query_time).round(2)} queries/sec)"

  puts "  Database size: #{File.size(db_path) / 1024 / 1024.0} MB"
  puts
end

# Run benchmarks with different configurations
[
  [128, 1_000, 100],     # Small dataset
  [128, 10_000, 100],    # Medium dataset
  [128, 100_000, 100],   # Large dataset
  [512, 10_000, 100]     # High-dimensional dataset
].each do |dim, num_vectors, num_queries|
  run_benchmark(dim, num_vectors, num_queries)
end

Sample Benchmark Results

Benchmarking with 128D vectors, 1000 items, 100 queries
  Insertion: 0.62 seconds (1612.90 vectors/sec)
  Average query time: 0.0027 seconds (370.37 queries/sec)
  Database size: 0.83 MB

Benchmarking with 128D vectors, 10000 items, 100 queries
  Insertion: 5.81 seconds (1721.17 vectors/sec)
  Average query time: 0.0089 seconds (112.36 queries/sec)
  Database size: 8.25 MB

Benchmarking with 128D vectors, 100000 items, 100 queries
  Insertion: 58.43 seconds (1711.62 vectors/sec)
  Average query time: 0.0752 seconds (13.30 queries/sec)
  Database size: 82.54 MB

Benchmarking with 512D vectors, 10000 items, 100 queries
  Insertion: 18.74 seconds (533.62 vectors/sec)
  Average query time: 0.0204 seconds (49.02 queries/sec)
  Database size: 32.94 MB

Performance Optimization Techniques

Based on the benchmarks, here are some optimization techniques we can apply:

Batch Processing: As shown in our implementation, batch processing significantly improves insertion performance.
Vector Quantization: We can compress vectors by quantizing their values, reducing memory usage and improving cache efficiency:

def quantize_vector(vector, bits_per_value = 8)
  # Find the min and max values
  min_val, max_val = vector.minmax
  range = max_val - min_val

  # Calculate the quantization step
  steps = (2**bits_per_value) - 1
  step_size = range / steps

  # Quantize each value
  quantized = vector.map do |v|
    [(v - min_val) / step_size, steps].min.to_i
  end

  [quantized, min_val, step_size]
end

def dequantize_vector(quantized, min_val, step_size)
  quantized.map { |q| (q * step_size) + min_val }
end

Product Quantization: For very high-dimensional vectors, we can apply product quantization to further reduce storage and improve search speed.
Multi-threading: We can parallelize both indexing and search operations:

require 'parallel'

# Parallel batch insertion
def parallel_batch_add(items, num_threads = 4)
  # Split items into chunks
  chunks = items.each_slice((items.size / num_threads.to_f).ceil).to_a

  # Process each chunk in parallel
  Parallel.each(chunks, in_threads: num_threads) do |chunk|
    @store.batch_add(chunk)
  end

  # Index building is often not thread-safe, so do it in the main thread
  items.each do |id, vector, _|
    @index.add(id, vector)
  end
end

Memory-mapped files: For very large datasets, we can use memory-mapped files to avoid loading everything into RAM:

require 'mmap'

class MmapVectorStore
  def initialize(file_path, vector_size, max_vectors)
    @vector_size = vector_size
    @bytes_per_vector = vector_size * 4  # 4 bytes per float

    # Create or open the file
    if File.exist?(file_path)
      @file = File.open(file_path, 'r+')
    else
      @file = File.open(file_path, 'w+')
      @file.truncate(max_vectors * @bytes_per_vector)
    end

    # Memory map the file
    @mmap = Mmap.new(@file.fileno, Mmap::MAP_SHARED, Mmap::PROT_READ | Mmap::PROT_WRITE)

    # Index to map IDs to offsets
    @id_to_offset = {}
    @next_offset = 0
  end

  def add(id, vector)
    # Calculate offset
    offset = @next_offset
    @next_offset += @bytes_per_vector

    # Store the mapping
    @id_to_offset[id] = offset

    # Write the vector to the memory-mapped file
    vector.each_with_index do |value, i|
      # Convert float to 4 bytes and write at the correct position
      bytes = [value].pack('f')
      pos = offset + (i * 4)

      bytes.each_byte.with_index do |byte, j|
        @mmap[pos + j] = byte
      end
    end
  end

  def get(id)
    offset = @id_to_offset[id]
    return nil unless offset

    # Read the vector from the memory-mapped file
    vector = []

    @vector_size.times do |i|
      pos = offset + (i * 4)
      bytes = @mmap[pos, 4]
      value = bytes.unpack('f')[0]
      vector << value
    end

    vector
  end

  # ... other methods ...

  def close
    @mmap.unmap
    @file.close
  end
end

Integration with AI Models

Now that we have a working vector database, let's see how to integrate it with AI models to build practical applications.

Text Embeddings with Ruby

We'll use a Ruby wrapper for the Hugging Face Transformers library to generate text embeddings:

require 'transformers'
require_relative 'vector_database'

class TextEmbeddingVectorDB
  def initialize(db_path, model_name = 'sentence-transformers/all-MiniLM-L6-v2')
    @model = Transformers::Pipeline.new(task: 'feature-extraction', model: model_name)
    @db = VectorDatabase.new(db_path, 384)  # 384 is the dimension for this model
  end

  def add_text(id, text, metadata = {})
    embedding = get_embedding(text)
    @db.add(id, embedding, metadata.merge({ text: text }))
  end

  def batch_add_texts(items)
    @db.batch_add(items.map do |id, text, metadata|
      [id, get_embedding(text), (metadata || {}).merge({ text: text })]
    end)
  end

  def search_by_text(query_text, k = 5)
    query_embedding = get_embedding(query_text)
    @db.search(query_embedding, k)
  end

  private

  def get_embedding(text)
    # The model returns embeddings as an array of arrays (one per token)
    # We take the mean of all token embeddings as the document embedding
    embeddings = @model.call(text)
    mean_pooling(embeddings)
  end

  def mean_pooling(embeddings)
    # Calculate mean of all token embeddings
    sum = Array.new(embeddings[0].size, 0.0)

    embeddings.each do |token_embedding|
      token_embedding.each_with_index do |value, i|
        sum[i] += value
      end
    end

    sum.map { |s| s / embeddings.size }
  end
end

Building a Semantic Search Engine

Let's build a simple semantic search engine using our vector database:

require_relative 'text_embedding_vector_db'

class SemanticSearchEngine
  def initialize(db_path)
    @db = TextEmbeddingVectorDB.new(db_path)
  end

  def index_documents(documents)
    puts "Indexing #{documents.size} documents..."

    items = documents.map do |doc|
      [
        doc[:id],
        doc[:content],
        { title: doc[:title], url: doc[:url], date: doc[:date] }
      ]
    end

    @db.batch_add_texts(items)
    puts "Indexing complete."
  end

  def search(query, k = 5)
    puts "Searching for: '#{query}'"
    results = @db.search_by_text(query, k)

    puts "Found #{results.size} results:"
    results.each_with_index do |(id, score, metadata), i|
      puts "#{i+1}. #{metadata[:title]} (Score: #{(1 - score).round(4)})"
      puts "   URL: #{metadata[:url]}"
      puts "   Preview: #{metadata[:text].slice(0, 100)}..." if metadata[:text]
      puts
    end

    results
  end
end

# Usage example
engine = SemanticSearchEngine.new("search_engine.db")

# Sample documents
documents = [
  {
    id: "doc1",
    title: "Introduction to Vector Databases",
    content: "Vector databases are specialized systems designed to store and query high-dimensional vectors, which are mathematical representations of data...",
    url: "https://example.com/vector-databases",
    date: "2023-01-15"
  },
  # ... more documents ...
]

engine.index_documents(documents)
engine.search("How do vector databases work?")

Image Similarity Search

We can also use our vector database for image similarity search:

require 'opencv'
require_relative 'vector_database'

class ImageSimilaritySearch
  def initialize(db_path)
    @db = VectorDatabase.new(db_path, 2048)  # ResNet features are 2048-dimensional
    @model = OpenCV::DNN.read_net_from_torch("resnet50.pth")
  end

  def add_image(id, image_path, metadata = {})
    features = extract_features(image_path)
    @db.add(id, features, metadata.merge({ path: image_path }))
  end

  def search(query_image_path, k = 5)
    features = extract_features(query_image_path)
    @db.search(features, k)
  end

  private

  def extract_features(image_path)
    # Load and preprocess image
    img = OpenCV::imread(image_path)
    blob = OpenCV::DNN.blob_from_image(
      img,
      1/255.0,  # scale factor
      OpenCV::Size.new(224, 224),  # size
      OpenCV::Scalar.new(0.485, 0.456, 0.406),  # mean
      true,  # swap RB
      false  # crop
    )

    # Extract features
    @model.setInput(blob)
    features = @model.forward

    # Convert to Ruby array
    features.to_a.flatten
  end
end

Production Considerations

When deploying a vector database in production, consider the following aspects:

Scaling Strategies

Sharding: Distribute vectors across multiple nodes based on a partitioning scheme.

class ShardedVectorDB
  def initialize(shard_count, dimension)
    @shards = shard_count.times.map do |i|
      VectorDatabase.new("shard_#{i}.db", dimension)
    end
    @shard_count = shard_count
  end

  def add(id, vector, metadata = {})
    shard_id = get_shard_id(id)
    @shards[shard_id].add(id, vector, metadata)
  end

  def search(query_vector, k = 5)
    # Query all shards and merge results
    all_results = @shards.flat_map do |shard|
      shard.search(query_vector, k)
    end

    # Sort and return top k
    all_results.sort_by { |_, distance, _| distance }.take(k)
  end

  private

  def get_shard_id(id)
    # Simple hash-based sharding
    id.hash.abs % @shard_count
  end
end

Replication

Create copies of your vector database for redundancy and read scaling:

class VectorDatabaseReplication
  def initialize(primary_db, replica_count = 2)
    @primary_db = primary_db
    @replicas = []

    replica_count.times do |i|
      @replicas << create_replica("replica_#{i}")
    end
  end

  def create_replica(name)
    # Clone the primary database
    replica = @primary_db.clone
    replica.name = name
    replica
  end

  def sync_replicas
    @replicas.each do |replica|
      # Synchronize changes from primary to replica
      primary_changes = @primary_db.changes_since(replica.last_sync)
      replica.apply_changes(primary_changes)
      replica.last_sync = Time.now
    end
  end

  def read_from_replica
    # Simple round-robin selection from replicas
    @replicas.rotate!.first
  end
end

Monitoring and Observability

Implementing logging and monitoring is crucial for production systems:

require 'logger'
require 'prometheus/client'

class VectorDatabaseMonitoring
  def initialize(db)
    @db = db
    @logger = Logger.new('vector_db.log')
    @registry = Prometheus::Client.registry

    # Define metrics
    @query_duration = @registry.histogram(
      :vector_db_query_duration_seconds,
      docstring: 'The time spent executing queries',
      labels: [:query_type]
    )

    @index_size = @registry.gauge(
      :vector_db_index_size_bytes,
      docstring: 'Size of the vector index in bytes'
    )

    @query_count = @registry.counter(
      :vector_db_query_count_total,
      docstring: 'Total number of queries',
      labels: [:query_type, :status]
    )
  end

  def log_query(query_type, duration, status)
    @logger.info("Query: #{query_type}, Duration: #{duration}s, Status: #{status}")
    @query_duration.observe(labels: { query_type: query_type }, value: duration)
    @query_count.increment(labels: { query_type: query_type, status: status })
  end

  def update_metrics
    @index_size.set(@db.index_size)
  end
end

Backup and Recovery

Implementing a reliable backup strategy is essential:

require 'fileutils'

class VectorDatabaseBackup
  def initialize(db, backup_dir = 'backups')
    @db = db
    @backup_dir = backup_dir
    FileUtils.mkdir_p(@backup_dir) unless Dir.exist?(@backup_dir)
  end

  def create_backup
    timestamp = Time.now.strftime('%Y%m%d%H%M%S')
    backup_path = File.join(@backup_dir, "vector_db_#{timestamp}.bkp")

    # Serialize the database to file
    File.open(backup_path, 'wb') do |file|
      file.write(Marshal.dump(@db))
    end

    # Compress the backup
    system("gzip #{backup_path}")

    "#{backup_path}.gz"
  end

  def restore_from_backup(backup_file)
    # Decompress if needed
    if backup_file.end_with?('.gz')
      system("gunzip -k #{backup_file}")
      backup_file = backup_file.gsub('.gz', '')
    end

    # Restore database state
    @db = Marshal.load(File.read(backup_file))

    true
  rescue => e
    puts "Restore failed: #{e.message}"
    false
  end

  def list_backups
    Dir.glob(File.join(@backup_dir, "*.bkp*")).sort
  end
end

Comparison with Existing Solutions

While building a custom vector database in Ruby is educational and can be useful for specific use cases, it's important to consider existing solutions that are battle-tested and optimized for production use.

Name	Description	Use Cases	Pros	Cons
Faiss (Facebook AI Similarity Search)	Efficient similarity search library	Large-scale image search, recommendation systems	Very fast, highly optimized, supports GPU	C++ based, needs bindings for Ruby
Milvus	Open-source vector database with advanced features	Production-ready vector search	Distributed architecture, comprehensive feature set	Complex setup, overkill for small applications
Pinecone	Fully managed vector database service	Rapid prototyping, production applications	Easy to use, scalable, no infrastructure management	Paid service, potential vendor lock-in
Weaviate	Vector search engine with knowledge graph capabilities	Semantic search with contextual understanding	Knowledge graph integration, strong typing	More complex than pure vector search
Qdrant	Vector similarity search engine	Production-ready vector search	Fast, supports filtering, horizontal scaling	Relatively new compared to others

Integrating Existing Solutions with Ruby

For serious production use cases, you might want to integrate with these existing solutions. Here's how to integrate Faiss with Ruby:

# Using the ruby-faiss gem (a wrapper for Faiss)
require 'faiss'

class FaissVectorSearch
  def initialize(dimension, index_type = 'Flat')
    @index = Faiss::Index.new(dimension, index_type)
  end

  def add_vectors(vectors)
    # Convert vectors to a properly formatted numpy-like array
    vectors_array = Faiss::FloatArray.new(vectors.flatten, vectors.size, vectors.first.size)
    @index.add(vectors_array)
  end

  def search(query_vector, k = 10)
    # Convert query to proper format
    query_array = Faiss::FloatArray.new(query_vector, 1, query_vector.size)
    distances, indices = @index.search(query_array, k)

    # Return results as Ruby arrays
    [distances.to_a, indices.to_a]
  end

  def save(filename)
    @index.write(filename)
  end

  def load(filename)
    @index = Faiss::Index.read(filename)
  end
end

When to Use a Custom Ruby Vector Database vs. Existing Solutions

Use a custom Ruby vector database when:

You need a lightweight solution for small to medium datasets
You want full control over the implementation
You're building a prototype or educational project
Integration with existing Ruby apps is a priority

Use existing vector database solutions when:

You're dealing with large-scale production workloads
Performance is critical
You need advanced features like distributed search or complex filtering
Your application requires horizontal scalability

Conclusion

Building a vector database in Ruby provides valuable insights into how these systems work under the hood. While our implementation may not match the performance of specialized libraries like Faiss or production-ready vector databases like Milvus, it serves as an excellent learning tool and can be practical for smaller applications.

Vector databases are becoming increasingly important in the age of AI, enabling powerful use cases like semantic search, recommendation systems, and image similarity search. By understanding the fundamentals of vector storage, indexing, and similarity search, you can make informed decisions about which solution to use for your specific requirements.

For production applications with large datasets and strict performance requirements, consider integrating with established vector database solutions. But for smaller projects or educational purposes, a custom Ruby vector database can be a perfect fit.

Whether you build your own or leverage existing tools, vector databases are a powerful addition to your AI application stack, enabling more intelligent, semantically rich interactions with your data.

Table of Contents