Implementation library: https://github.com/prestonvasquez/vectormock
Problem. Integration testing vector databases with live embedding models can be tricky since results may not always be deterministic. For example, three consecutive similarity searches for "South American Authors" might return:
{text: "Gabriel García Márquez", score: 0.80} # Search 1
{text: "Gabriel García Márquez", score: 0.82} # Search 2
{text: "Gabriel García Márquez", score: 0.78} # Search 3
Suppose we expect the above similarity search to result in a score of 0.8
. A common instinct to resolve false negatives (i.e. the 0.82
and 0.78
cases) would be to define an acceptable range for similarity scores:
def is_score_in_range(doc, lower, upper):
return lower <= doc.score <= upper
class TestSimilarityScore(unittest.TestCase):
def test_doc_scores(self):
docs = [{"score": 0.80}, {"score": 0.82}, {"score": 0.78}]
for doc in docs:
# If we expect 0.80, then the delta = 2.
self.assertTrue(is_score_in_range(doc, 0.78, 0.82))
Of course, we are met with a difficult question: how do you set the delta? Large deltas increase false positives; small deltas increases false negatives:
# Wouldn't pass with delta=2
{text: "Gabriel García Márquez", score: 0.779}
# Would pass with delta=40
{text: "Gabriel García Márquez", score: 0.41}
Additionally, embedding models may have rate limits, quotas, or per-usage fees. All of which add to the difficulty of maintaining robust tests.
Goal. This article shows how to overcome these obstacles by mocking an embedding model for MongoDB Atlas, making integration tests more predictable and reliable. For example, say we want the query "South American Authors" to always return the following three documents (without variation) for a vector database integration test:
{text: "Gabriel García Márquez", score: 0.80}
{text: "Gabriela Mistral", score: 0.94}
{text: "Miguel de Cervantes", score: 0.07}
This guide is specifically focused on similarity scores derived through the dot product.
To simplify things, we'll focus on LangChain's definition of an embedding model, which requires the implementation of two key functions: embed_query
and embed_documents
.
Deriving embed_query
. The goal of an embedding model (embedder) is to transform
-queries to
-dimensional vectors (an embedding) such that the relationships between different inputs are preserved in the embedding space.
So let be a mapping from a query vector to a set of similar vectors defined as the column space of , up-to the similarity score algorithm (e.g. dot product).
We want to define such that the following holds for all :
Where is derived from normalizing the similarity score. For example, Atlas Vector Search uses the following algorithm to normalize the score using dot product:
Therefore, .
Using our example, let be the query vector for "South American Authors". For simplicity, embeddable documents are just a tuple where is the textual representation of a vector in and is the desired score as it relates to the query vector. If we call the mock transformer, then let be the embeddings for Márquez, Mistral, and Cervantes respectively. Without loss of generality:
And to complete the mock embedding, we want:
Looking at libraries like LangChain,
is analogous to similarity_search
and
is embed_documents
.
is a subset of the vector store (e.g. a MongoDB collection) and if it was embedded by text-embedding-3-small
, for example,
would be
. Additionally, since
is fixed, then embed_query
is simply an accessor:
def embed_query(self, context, text):
# Return the query vector stored in the object (emb.query_vector)
return self.query_vector, None
The next step is to define an algorithm for deriving . But before we get into that, enjoy this (https://xkcd.com/1838/):
Deriving embed_documents
. Generally,
maps a document
to a vector
such that
:
In other words, is used to generate the vectors that compose from (eq. 1). By definition (eq. 2), since and the dimension of is also , the vectors in must be linearly independent. Therefore defining can be simplified into two conditions that must both hold:
- All vectors in the set are linearly independent (Algorithm A2).
- , where (Algorithm A3).
Without loss of generality, to satisfy condition 1, we need at least one such that the following does not hold:
This ensures that is not a scalar multiple of . To guarantee condition 2 is met, we update :
One very important note: this method will not guarantee the final element falls in the range of . This is not a requirement in MongoDB Atlas, which is why we don't worry about normalizing it in this guide.
Finally, it follows that algorithm for mocking embed_documents
becomes this:
def embed_documents(ctx, emb, texts):
vectors = []
for text in texts:
# If the text doesn't exist in the document set, return a zero vector
if text not in emb["docs"]:
vectors.append([0.0] * len(emb["query_vector"]))
continue
# If the vector already exists, return the existing one
if text in emb["doc_vectors"]:
vectors.append(emb["doc_vectors"][text])
continue
# Otherwise, generate a new orthogonal vector
new_vector_basis = new_orthogonal_vector(len(emb["query_vector"]), *existing_vectors(emb))
# Update the last element of the vector so that
# v^* * v = 2S - 1.
doc = emb["docs"][text]
new_vector = new_score_vector(doc["Score"], emb["query_vector"], new_vector_basis)
# Store and return the new vector
vectors.append(new_vector)
emb["doc_vectors"][text] = new_vector
return vectors
Example. Lets generate for our "South American Author" example. For simplicity, say . First we generate the query vector using Algorithm A1:
Using new_orthogonal_vector
from Algorithm A2, create a vector orthogonal to
for Márquez:
From eq. 3 it follows that we need to update such that . Using eq. 4 we get that
Substituting the known values:
Solving for :
Thus, the updated vector is:
See here for a working example using LangChainGo. Note, this will require a MongoDB Atlas Cluster with a vector search index. Check out the tutorial!
Algorithms. Here is a list of algorithms used to mock an embedding model.
A1. Create a vector of normalized 32-bit floats
def new_normalized_float32():
try:
# Generate a random integer in the range [0, 2^24)
max_value = 1 << 24
n = random.randint(0, max_value - 1)
# Normalize it to the range [-1, 1]
normalized_float = 2.0 * (float(n) / max_value) - 1.0
return normalized_float
except Exception as e:
raise Exception("Failed to normalize float32") from e
def new_normalized_vector(n):
vector = [0.0] * n
for i in range(n):
vector[i] = new_normalized_float32()
return vector
A2. Create
-dimensional normalized vectors
# Compute the dot product between two lists (vectors) of floats.
def dot_product(v1, v2):
sum = 0.0
for i in range(len(v1)):
sum += v1[i] * v2[i]
return sum
# Use Gram-Schmidt to return a vector orthogonal to the basis,
# assuming the vectors in the basis are linearly independent.
def new_orthogonal_vector(dim, *basis):
candidate = new_normalized_vector(dim)
for b in basis:
dp = dot_product(candidate, b)
basis_norm = dot_product(b, b)
for i in range(len(candidate)):
candidate[i] -= (dp / basis_norm) * b[i]
return candidate
# Make n linearly independent vectors of size dim.
def new_linearly_independent_vectors(n, dim):
vectors = []
for _ in range(n):
v = new_orthogonal_vector(dim, *vectors)
vectors.append(v)
return vectors
A3. Update two linearly independent vectors such that their dot product is a desired value.
# Update the basis vector such that qvector * basis = 2S - 1.
def new_score_vector(S, qvector, basis):
sum_value = 0.0
# Populate v2 up to dim-1.
for i in range(len(qvector) - 1):
sum_value += qvector[i] * basis[i]
# Calculate v_{2, dim} such that v1 * v2 = 2S - 1:
basis[-1] = (2 * S - 1 - sum_value) / qvector[-1]
# If the vectors are not linearly independent, regenerate the dim-1 elements of v2.
if not linearly_independent(qvector, basis):
return new_score_vector(S, qvector, basis)
return basis
Top comments (0)