Preston Vasquez

Posted on Sep 9

Mocking an LLM Embedder Targeting MongoDB Atlas

#llm #vectordatabase #mongodb #testing

Implementation library: https://github.com/prestonvasquez/vectormock

Problem. Integration testing vector databases with live embedding models can be tricky since results may not always be deterministic. For example, three consecutive similarity searches for "South American Authors" might return:

{text: "Gabriel García Márquez", score: 0.80} # Search 1
{text: "Gabriel García Márquez", score: 0.82} # Search 2
{text: "Gabriel García Márquez", score: 0.78} # Search 3

Suppose we expect the above similarity search to result in a score of 0.8. A common instinct to resolve false negatives (i.e. the 0.82 and 0.78 cases) would be to define an acceptable range for similarity scores:

def is_score_in_range(doc, lower, upper):
    return lower <= doc.score <= upper

class TestSimilarityScore(unittest.TestCase):
    def test_doc_scores(self):
        docs = [{"score": 0.80}, {"score": 0.82}, {"score": 0.78}]
        for doc in docs:
            # If we expect 0.80, then the delta = 2.
            self.assertTrue(is_score_in_range(doc, 0.78, 0.82))

Of course, we are met with a difficult question: how do you set the delta? Large deltas increase false positives; small deltas increases false negatives:

# Wouldn't pass with delta=2
{text: "Gabriel García Márquez", score: 0.779}

# Would pass with delta=40
{text: "Gabriel García Márquez", score: 0.41}

Additionally, embedding models may have rate limits, quotas, or per-usage fees. All of which add to the difficulty of maintaining robust tests.

Goal. This article shows how to overcome these obstacles by mocking an embedding model for MongoDB Atlas, making integration tests more predictable and reliable. For example, say we want the query "South American Authors" to always return the following three documents (without variation) for a vector database integration test:

{text: "Gabriel García Márquez", score: 0.80}
{text: "Gabriela Mistral", score: 0.94}       
{text: "Miguel de Cervantes", score: 0.07}

This guide is specifically focused on similarity scores derived through the dot product.

To simplify things, we'll focus on LangChain's definition of an embedding model, which requires the implementation of two key functions: embed_query and embed_documents.

Deriving embed_query. The goal of an embedding model (embedder) is to transform $N$ -queries to $N$ $n$ -dimensional vectors (an embedding) such that the relationships between different inputs are preserved in the embedding space.

So let $q$ be a mapping from a query vector $\mathbf{v}^{*}$ to a set of similar vectors defined as the column space of $\mathbf{E} \in \mathbb{R}^{n\times N}$ , up-to the similarity score algorithm (e.g. dot product).

\begin{equation} q:\mathbf{v}^* \rightarrow \text{C}(\mathbf{E}) \end{equation}

We want to define $\mathbf{v}^*$ such that the following holds for all $\mathbf{v} \in \text{C}(\mathbf{E})$ :

\begin{equation} \mathbf{v}^{*} \cdot \mathbf{v} = \hat{S} \end{equation}

Where $\hat{S}$ is derived from normalizing the similarity score. For example, Atlas Vector Search uses the following algorithm to normalize the score using dot product:

\begin{equation*} S = \frac{1}{2}\left(1+ \mathbf{v}^{*} \cdot \mathbf{v}\right) \end{equation*}

Therefore, $\hat{S}=2S - 1$ .

Using our example, let $\mathbf{v_A}^*$ be the query vector for "South American Authors". For simplicity, embeddable documents are just a tuple $\left(P, S\right)$ where $P$ is the textual representation of a vector in $\text{C}(\mathbf{E})$ and $S$ is the desired score as it relates to the query vector. If we call $\phi$ the mock transformer, then let $\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3$ be the embeddings for Márquez, Mistral, and Cervantes respectively. Without loss of generality:

\begin{equation*} \phi\left(\text{M\'{a}rquez}, 0.80\right) \rightarrow \mathbf{v}_1 \quad \text{such that} \quad \mathbf{v_A}^* \cdot \mathbf{v}_1 = 0.6 \end{equation*}

And to complete the mock embedding, we want:

\begin{equation*} q\left(\mathbf{v_A}^{*}\right) \to \left[\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3 \right] \end{equation*}

Looking at libraries like LangChain, $q$ is analogous to similarity_search and $\phi$ is embed_documents. $C(\mathbf{E})$ is a subset of the vector store (e.g. a MongoDB collection) and if it was embedded by text-embedding-3-small, for example, $N$ would be $1536$ . Additionally, since $\mathbf{v_A}^*$ is fixed, then embed_query is simply an accessor:

def embed_query(self, context, text):
    # Return the query vector stored in the object (emb.query_vector)
    return self.query_vector, None

The next step is to define an algorithm for deriving $\phi$ . But before we get into that, enjoy this (https://xkcd.com/1838/):

Deriving embed_documents. Generally, $\phi^{\mathbf{v}^{*}}$ maps a document $\left(P, S\right)$ to a vector $\mathbf{v}$ such that $\mathbf{v}^{*} \cdot \mathbf{v} = 2S - 1$ :

\begin{equation} \phi^{\mathbf{v}^{*}}: \left(P, S\right) \rightarrow \mathbf{v} \quad \text{such that} \quad \mathbf{v}^* \cdot \mathbf{v} = 2S - 1 \end{equation}

In other words, $\phi^{\mathbf{v}^{*}}$ is used to generate the vectors that compose $\mathbf{E}$ from (eq. 1). By definition (eq. 2), since $\text{Rank}(\mathbf{E}) = N$ and the dimension of $\text{C}(\mathbf{E})$ is also $N$ , the vectors in $\text{C}(\mathbf{E})$ must be linearly independent. Therefore defining $\phi^{\mathbf{v}^{*}}$ can be simplified into two conditions that must both hold:

All vectors in the set ${\mathbf{v_1}, \mathbf{v_2}, \ldots, \mathbf{v_N}, \mathbf{v^*}}$ are linearly independent (Algorithm A2).
$\mathbf{v} \cdot \mathbf{v^*} = 2S_{\mathbf{v}} - 1$ , where $\mathbf{v} \in \text{C}(\mathbf{E})$ (Algorithm A3).

Without loss of generality, to satisfy condition 1, we need at least one $v_i$ such that the following does not hold:

\begin{equation*} \frac{v_1}{v_1^*} = \frac{v_2}{v_2^*} = \ldots = \frac{v_{n-1}}{v_{n-1}^*} \end{equation*}

This ensures that $\mathbf{v}$ is not a scalar multiple of $\mathbf{v}^*$ . To guarantee condition 2 is met, we update $v_n$ :

\begin{equation} v_n = \frac{1}{v_n^*} \left(2S - 1 - \sum_{i=1}^{n-1} v_i^*v_i \right) \end{equation}

One very important note: this method will not guarantee the final element falls in the range of $[-1, 1]$ . This is not a requirement in MongoDB Atlas, which is why we don't worry about normalizing it in this guide.

Finally, it follows that algorithm for mocking embed_documents becomes this:

def embed_documents(ctx, emb, texts):
    vectors = []

    for text in texts:
        # If the text doesn't exist in the document set, return a zero vector
        if text not in emb["docs"]:
            vectors.append([0.0] * len(emb["query_vector"]))
            continue

        # If the vector already exists, return the existing one
        if text in emb["doc_vectors"]:
            vectors.append(emb["doc_vectors"][text])
            continue

        # Otherwise, generate a new orthogonal vector
        new_vector_basis = new_orthogonal_vector(len(emb["query_vector"]), *existing_vectors(emb))

        # Update the last element of the vector so that 
        # v^* * v = 2S - 1.
        doc = emb["docs"][text]
        new_vector = new_score_vector(doc["Score"], emb["query_vector"], new_vector_basis)

        # Store and return the new vector
        vectors.append(new_vector)
        emb["doc_vectors"][text] = new_vector

    return vectors

Example. Lets generate $\mathbf{E}$ for our "South American Author" example. For simplicity, say $n = 3$ . First we generate the query vector using Algorithm A1:

\begin{equation*} \mathbf{v}_A^{*} = (-0.5, 0.25, 0.71) \end{equation*}

Using new_orthogonal_vector from Algorithm A2, create a vector orthogonal to $\mathbf{v}_A^{*}$ for Márquez:

\begin{equation*} \mathbf{v}1 = (0.6, -0.8, v{13}) \end{equation*}

From eq. 3 it follows that we need to update $v_{13}$ such that $\mathbf{v}_A^{*} \cdot \mathbf{v}_1 = 0.6$ . Using eq. 4 we get that

\begin{equation*} \mathbf{v}{A}^{*} \cdot \mathbf{v}_1 = (-0.5)(0.6) + (0.25)(-0.8) + (0.71)(v{13}) \end{equation*}

Substituting the known values:

\begin{equation*} -0.3 - 0.2 + 0.71 v_{13} = 0.6 \end{equation*}

Solving for $v_{13}$ :

\begin{equation*} 0.71 v_{13} = 1.1 \implies v_{13} = \frac{1.1}{0.71} \approx 1.549 \end{equation*}

Thus, the updated vector $\mathbf{v}_1$ is:

\begin{equation*} \mathbf{v}_1 = (0.6, -0.8, 1.549) \end{equation*}

See here for a working example using LangChainGo. Note, this will require a MongoDB Atlas Cluster with a vector search index. Check out the tutorial!

Algorithms. Here is a list of algorithms used to mock an embedding model.

A1. Create a vector of normalized 32-bit floats

def new_normalized_float32():
    try:
        # Generate a random integer in the range [0, 2^24)
        max_value = 1 << 24
        n = random.randint(0, max_value - 1)

        # Normalize it to the range [-1, 1]
        normalized_float = 2.0 * (float(n) / max_value) - 1.0

        return normalized_float
    except Exception as e:
        raise Exception("Failed to normalize float32") from e

def new_normalized_vector(n):
    vector = [0.0] * n
    for i in range(n):
        vector[i] = new_normalized_float32()

    return vector

A2. Create $N$ $n$ -dimensional normalized vectors

# Compute the dot product between two lists (vectors) of floats.
def dot_product(v1, v2):
    sum = 0.0
    for i in range(len(v1)):
        sum += v1[i] * v2[i]
    return sum

# Use Gram-Schmidt to return a vector orthogonal to the basis,
# assuming the vectors in the basis are linearly independent.
def new_orthogonal_vector(dim, *basis):
    candidate = new_normalized_vector(dim)

    for b in basis:
        dp = dot_product(candidate, b)
        basis_norm = dot_product(b, b)

        for i in range(len(candidate)):
            candidate[i] -= (dp / basis_norm) * b[i]

    return candidate

# Make n linearly independent vectors of size dim.
def new_linearly_independent_vectors(n, dim):
    vectors = []

    for _ in range(n):
        v = new_orthogonal_vector(dim, *vectors)
        vectors.append(v)

    return vectors

A3. Update two linearly independent vectors such that their dot product is a desired value.

# Update the basis vector such that qvector * basis = 2S - 1.
def new_score_vector(S, qvector, basis):
    sum_value = 0.0

    # Populate v2 up to dim-1.
    for i in range(len(qvector) - 1):
        sum_value += qvector[i] * basis[i]

    # Calculate v_{2, dim} such that v1 * v2 = 2S - 1:
    basis[-1] = (2 * S - 1 - sum_value) / qvector[-1]

    # If the vectors are not linearly independent, regenerate the dim-1 elements of v2.
    if not linearly_independent(qvector, basis):
        return new_score_vector(S, qvector, basis)

    return basis

DEV Community

Mocking an LLM Embedder Targeting MongoDB Atlas

Top comments (0)

Read next

How To Inspect Element on iPhone [4 Quick Methods]

Beyond boring 🙄 markdown rendering with LLMs ✨ and React ⚛️

Multi-Agent Data Analysis System to Bridge the Gap Between Domain Experts And Data Science.

Why I’m Excited About Microsoft’s SpreadsheetLLM