DEV Community

Cover image for Mocking an LLM Embedder Targeting MongoDB Atlas
Preston Vasquez
Preston Vasquez

Posted on

Mocking an LLM Embedder Targeting MongoDB Atlas

Implementation library: https://github.com/prestonvasquez/vectormock

Problem. Integration testing vector databases with live embedding models can be tricky since results may not always be deterministic. For example, three consecutive similarity searches for "South American Authors" might return:

{text: "Gabriel García Márquez", score: 0.80} # Search 1
{text: "Gabriel García Márquez", score: 0.82} # Search 2
{text: "Gabriel García Márquez", score: 0.78} # Search 3
Enter fullscreen mode Exit fullscreen mode

Suppose we expect the above similarity search to result in a score of 0.8. A common instinct to resolve false negatives (i.e. the 0.82 and 0.78 cases) would be to define an acceptable range for similarity scores:

def is_score_in_range(doc, lower, upper):
    return lower <= doc.score <= upper

class TestSimilarityScore(unittest.TestCase):
    def test_doc_scores(self):
        docs = [{"score": 0.80}, {"score": 0.82}, {"score": 0.78}]
        for doc in docs:
            # If we expect 0.80, then the delta = 2.
            self.assertTrue(is_score_in_range(doc, 0.78, 0.82))
Enter fullscreen mode Exit fullscreen mode

Of course, we are met with a difficult question: how do you set the delta? Large deltas increase false positives; small deltas increases false negatives:

# Wouldn't pass with delta=2
{text: "Gabriel García Márquez", score: 0.779}

# Would pass with delta=40
{text: "Gabriel García Márquez", score: 0.41}
Enter fullscreen mode Exit fullscreen mode

Additionally, embedding models may have rate limits, quotas, or per-usage fees. All of which add to the difficulty of maintaining robust tests.

Goal. This article shows how to overcome these obstacles by mocking an embedding model for MongoDB Atlas, making integration tests more predictable and reliable. For example, say we want the query "South American Authors" to always return the following three documents (without variation) for a vector database integration test:

{text: "Gabriel García Márquez", score: 0.80}
{text: "Gabriela Mistral", score: 0.94}       
{text: "Miguel de Cervantes", score: 0.07}   
Enter fullscreen mode Exit fullscreen mode

This guide is specifically focused on similarity scores derived through the dot product.

To simplify things, we'll focus on LangChain's definition of an embedding model, which requires the implementation of two key functions: embed_query and embed_documents.

Deriving embed_query. The goal of an embedding model (embedder) is to transform NN -queries to NN nn -dimensional vectors (an embedding) such that the relationships between different inputs are preserved in the embedding space.

Image description

So let qq be a mapping from a query vector v\mathbf{v}^{*} to a set of similar vectors defined as the column space of ERn×N\mathbf{E} \in \mathbb{R}^{n\times N} , up-to the similarity score algorithm (e.g. dot product).

q:vC(E) \begin{equation} q:\mathbf{v}^* \rightarrow \text{C}(\mathbf{E}) \end{equation}

We want to define v\mathbf{v}^* such that the following holds for all vC(E)\mathbf{v} \in \text{C}(\mathbf{E}) :

vv=S^ \begin{equation} \mathbf{v}^{*} \cdot \mathbf{v} = \hat{S} \end{equation}

Where S^\hat{S} is derived from normalizing the similarity score. For example, Atlas Vector Search uses the following algorithm to normalize the score using dot product:

S=12(1+vv) \begin{equation*} S = \frac{1}{2}\left(1+ \mathbf{v}^{*} \cdot \mathbf{v}\right) \end{equation*}

Therefore, S^=2S1\hat{S}=2S - 1 .

Using our example, let vA\mathbf{v_A}^* be the query vector for "South American Authors". For simplicity, embeddable documents are just a tuple (P,S)\left(P, S\right) where PP is the textual representation of a vector in C(E)\text{C}(\mathbf{E}) and SS is the desired score as it relates to the query vector. If we call ϕ\phi the mock transformer, then let v1,v2,v3\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3 be the embeddings for Márquez, Mistral, and Cervantes respectively. Without loss of generality:

ϕ(Maˊrquez,0.80)v1such thatvAv1=0.6 \begin{equation*} \phi\left(\text{M\'{a}rquez}, 0.80\right) \rightarrow \mathbf{v}_1 \quad \text{such that} \quad \mathbf{v_A}^* \cdot \mathbf{v}_1 = 0.6 \end{equation*}

And to complete the mock embedding, we want:

q(vA)[v1,v2,v3] \begin{equation*} q\left(\mathbf{v_A}^{*}\right) \to \left[\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3 \right] \end{equation*}

Looking at libraries like LangChain, qq is analogous to similarity_search and ϕ\phi is embed_documents. C(E)C(\mathbf{E}) is a subset of the vector store (e.g. a MongoDB collection) and if it was embedded by text-embedding-3-small, for example, NN would be 15361536 . Additionally, since vA\mathbf{v_A}^* is fixed, then embed_query is simply an accessor:

def embed_query(self, context, text):
    # Return the query vector stored in the object (emb.query_vector)
    return self.query_vector, None
Enter fullscreen mode Exit fullscreen mode

The next step is to define an algorithm for deriving ϕ\phi . But before we get into that, enjoy this (https://xkcd.com/1838/):

Image description

Deriving embed_documents. Generally, ϕv\phi^{\mathbf{v}^{*}} maps a document (P,S)\left(P, S\right) to a vector v\mathbf{v} such that vv=2S1\mathbf{v}^{*} \cdot \mathbf{v} = 2S - 1 :

ϕv:(P,S)vsuch thatvv=2S1 \begin{equation} \phi^{\mathbf{v}^{*}}: \left(P, S\right) \rightarrow \mathbf{v} \quad \text{such that} \quad \mathbf{v}^* \cdot \mathbf{v} = 2S - 1 \end{equation}

In other words, ϕv\phi^{\mathbf{v}^{*}} is used to generate the vectors that compose E\mathbf{E} from (eq. 1). By definition (eq. 2), since Rank(E)=N\text{Rank}(\mathbf{E}) = N and the dimension of C(E)\text{C}(\mathbf{E}) is also NN , the vectors in C(E)\text{C}(\mathbf{E}) must be linearly independent. Therefore defining ϕv\phi^{\mathbf{v}^{*}} can be simplified into two conditions that must both hold:

  1. All vectors in the set v1,v2,,vN,v{\mathbf{v_1}, \mathbf{v_2}, \ldots, \mathbf{v_N}, \mathbf{v^*}} are linearly independent (Algorithm A2).
  2. vv=2Sv1\mathbf{v} \cdot \mathbf{v^*} = 2S_{\mathbf{v}} - 1 , where vC(E)\mathbf{v} \in \text{C}(\mathbf{E}) (Algorithm A3).

Without loss of generality, to satisfy condition 1, we need at least one viv_i such that the following does not hold:

v1v1=v2v2==vn1vn1 \begin{equation*} \frac{v_1}{v_1^*} = \frac{v_2}{v_2^*} = \ldots = \frac{v_{n-1}}{v_{n-1}^*} \end{equation*}

This ensures that v\mathbf{v} is not a scalar multiple of v\mathbf{v}^* . To guarantee condition 2 is met, we update vnv_n :

vn=1vn(2S1i=1n1vivi) \begin{equation} v_n = \frac{1}{v_n^*} \left(2S - 1 - \sum_{i=1}^{n-1} v_i^*v_i \right) \end{equation}

One very important note: this method will not guarantee the final element falls in the range of [1,1][-1, 1] . This is not a requirement in MongoDB Atlas, which is why we don't worry about normalizing it in this guide.

Finally, it follows that algorithm for mocking embed_documents becomes this:

def embed_documents(ctx, emb, texts):
    vectors = []

    for text in texts:
        # If the text doesn't exist in the document set, return a zero vector
        if text not in emb["docs"]:
            vectors.append([0.0] * len(emb["query_vector"]))
            continue

        # If the vector already exists, return the existing one
        if text in emb["doc_vectors"]:
            vectors.append(emb["doc_vectors"][text])
            continue

        # Otherwise, generate a new orthogonal vector
        new_vector_basis = new_orthogonal_vector(len(emb["query_vector"]), *existing_vectors(emb))

        # Update the last element of the vector so that 
        # v^* * v = 2S - 1.
        doc = emb["docs"][text]
        new_vector = new_score_vector(doc["Score"], emb["query_vector"], new_vector_basis)

        # Store and return the new vector
        vectors.append(new_vector)
        emb["doc_vectors"][text] = new_vector

    return vectors
Enter fullscreen mode Exit fullscreen mode

Example. Lets generate E\mathbf{E} for our "South American Author" example. For simplicity, say n=3n = 3 . First we generate the query vector using Algorithm A1:

vA=(0.5,0.25,0.71) \begin{equation*} \mathbf{v}_A^{*} = (-0.5, 0.25, 0.71) \end{equation*}

Using new_orthogonal_vector from Algorithm A2, create a vector orthogonal to vA\mathbf{v}_A^{*} for Márquez:

v1=(0.6,0.8,v13) \begin{equation*} \mathbf{v}1 = (0.6, -0.8, v{13}) \end{equation*}

From eq. 3 it follows that we need to update v13v_{13} such that vAv1=0.6\mathbf{v}_A^{*} \cdot \mathbf{v}_1 = 0.6 . Using eq. 4 we get that

vAv1=(0.5)(0.6)+(0.25)(0.8)+(0.71)(v13) \begin{equation*} \mathbf{v}{A}^{*} \cdot \mathbf{v}_1 = (-0.5)(0.6) + (0.25)(-0.8) + (0.71)(v{13}) \end{equation*}

Substituting the known values:

0.30.2+0.71v13=0.6 \begin{equation*} -0.3 - 0.2 + 0.71 v_{13} = 0.6 \end{equation*}

Solving for v13v_{13} :

0.71v13=1.1    v13=1.10.711.549 \begin{equation*} 0.71 v_{13} = 1.1 \implies v_{13} = \frac{1.1}{0.71} \approx 1.549 \end{equation*}

Thus, the updated vector v1\mathbf{v}_1 is:

v1=(0.6,0.8,1.549) \begin{equation*} \mathbf{v}_1 = (0.6, -0.8, 1.549) \end{equation*}

See here for a working example using LangChainGo. Note, this will require a MongoDB Atlas Cluster with a vector search index. Check out the tutorial!

Algorithms. Here is a list of algorithms used to mock an embedding model.

A1. Create a vector of normalized 32-bit floats

def new_normalized_float32():
    try:
        # Generate a random integer in the range [0, 2^24)
        max_value = 1 << 24
        n = random.randint(0, max_value - 1)

        # Normalize it to the range [-1, 1]
        normalized_float = 2.0 * (float(n) / max_value) - 1.0

        return normalized_float
    except Exception as e:
        raise Exception("Failed to normalize float32") from e

def new_normalized_vector(n):
    vector = [0.0] * n
    for i in range(n):
        vector[i] = new_normalized_float32()

    return vector
Enter fullscreen mode Exit fullscreen mode

A2. Create NN nn -dimensional normalized vectors

# Compute the dot product between two lists (vectors) of floats.
def dot_product(v1, v2):
    sum = 0.0
    for i in range(len(v1)):
        sum += v1[i] * v2[i]
    return sum

# Use Gram-Schmidt to return a vector orthogonal to the basis,
# assuming the vectors in the basis are linearly independent.
def new_orthogonal_vector(dim, *basis):
    candidate = new_normalized_vector(dim)

    for b in basis:
        dp = dot_product(candidate, b)
        basis_norm = dot_product(b, b)

        for i in range(len(candidate)):
            candidate[i] -= (dp / basis_norm) * b[i]

    return candidate

# Make n linearly independent vectors of size dim.
def new_linearly_independent_vectors(n, dim):
    vectors = []

    for _ in range(n):
        v = new_orthogonal_vector(dim, *vectors)
        vectors.append(v)

    return vectors
Enter fullscreen mode Exit fullscreen mode

A3. Update two linearly independent vectors such that their dot product is a desired value.

# Update the basis vector such that qvector * basis = 2S - 1.
def new_score_vector(S, qvector, basis):
    sum_value = 0.0

    # Populate v2 up to dim-1.
    for i in range(len(qvector) - 1):
        sum_value += qvector[i] * basis[i]

    # Calculate v_{2, dim} such that v1 * v2 = 2S - 1:
    basis[-1] = (2 * S - 1 - sum_value) / qvector[-1]

    # If the vectors are not linearly independent, regenerate the dim-1 elements of v2.
    if not linearly_independent(qvector, basis):
        return new_score_vector(S, qvector, basis)

    return basis
Enter fullscreen mode Exit fullscreen mode

Top comments (0)