Jianyang Gao

Posted on Dec 8, 2024

Quantization in The Counterintuitive High-Dimensional Space

#vectordatabase #rag #quantization #rabitq

The cover is generated by DALL-E.

In recent years, my research has focused on algorithms for approximate nearest neighbor (ANN) search in high-dimensional spaces, which is also known as vector search. This area has become increasingly significant due to its applications in fields like vector databases, large language models (LLMs), and retrieval-augmented generation (RAG). High-dimensional spaces, however, are full of counterintuitive phenomena that presents unique challenges for algorithm design. Through my research journey, I have learned that these phenomena can either pose obstacles or offer valuable opportunities, depending on how well they are understood and applied.

In this post, I will share a famous counterintuitive phenomenon in high-dimensional spaces, explain its underlying principles, and demonstrate how it can be utilized to improve the accuracy of quantization algorithms. This discussion aims to provide intuitive insights and complementary explanations to our research on RaBitQ and its extensions, which introduce optimized approaches to binary and scalar quantization, respectively. The codes for RaBitQ and its extensions have been publicly available since earlier this year (2024).

A Counterintuitive Phenomenon

Designing algorithms for high-dimensional spaces presents unique challenges. One major bottleneck is that the intuition based on our everyday experience in three-dimensional space often fails in higher dimensions. High-dimensional probability theory offers valuable insights into the structure of these spaces, but the heavy reliance on complex mathematics can sometimes make it difficult to grasp intuitively. In this post, rather than diving straight into the math, let us explore some simple examples to build an intuitive understanding of the counterintuitive facts in high-dimensional spaces.

Let us begin by considering an arbitrary unit vector (a vector with a length of 1), denoted as $\mathbf{x}$ . Suppose the following scenario: we are interested in the value of its projection onto a certain vector—let's take the first coordinate vector for simplicity—represented as $\mathbf{x}[0]$ . However, for some reason, we do not have direct access to this value and instead seek a rough estimate of its range without performing any computation. Since the vector $\mathbf{x}$ has a length of 1, and without any additional information, the only conclusion we can draw is that $\mathbf{x}[0]$ must lie within the range $[-1, 1]$ .

Next, consider a random unit vector uniformly distributed on the unit sphere, meaning it has equal probability of appearing at any point on the sphere's surface. To gain insight, let us generate some random vectors following this distribution and examine the value of $\mathbf{x}[0]$ . The empirical distribution of $\mathbf{x}[0]$ is plotted as follows.

$The Distribution of $\mathbf{x}[1] endkatex$

On the left side, we present the case where the vector has 3 dimensions. This scenario appears intuitive: for a unit vector (i.e., with length of 1), $\mathbf{x}[0]$ can take any value within the range $[-1, 1]$ .
On the right side, however, when the vector has 1000 dimensions, the situation becomes quite unusual. While the possible range of $\mathbf{x}[0]$ remains $[-1, 1]$ , the figure reveals a striking phenomenon: the actual range of $\mathbf{x}[0]$ is narrowly concentrated around 0.

This unexpected behavior underscores a fundamental difference between low-dimensional and high-dimensional spaces: in high-dimensional spaces, randomness can give rise to surprising certainty.

But Why?

This phenomenon is explained by the Concentration of Measure, a fundamental principle in high-dimensional geometry and probability. To better understand this behavior, let’s continue exploring the example and develop some intuitive explanations for the phenomenon.

Recall that $\mathbf{x}$ is a unit vector, meaning its length is 1. The length is computed using the formula:

|| \mathbf{x}||^2 = \sum_{i=1}^{D} \mathbf{x}[i]^2

Here

D

is the dimensionality of the vector.

When $D=3$ :
The squared length of the vector equals the sum of three non-negative terms. On average, each term contributes $\frac{1}{3}$ to the total. It is not surprising if one term contributes a slightly larger proportion (e.g., more than 50%) of the total length, while the remaining two terms contribute less.
When $D=1000$ :
The squared length now equals the sum of 1000 terms, with each term contributing, on average, $\frac{1}{1000}$ . If a single term were to contribute a disproportionately large fraction of the length (e.g., over 50%), this would require many other terms to contribute far less than their average share, corresponding to a scenario which has an exceedingly low probability.

Thus, in high-dimensional spaces, the value $\mathbf{x}[0]$ is unlikely to deviate from 0 significantly. Formally, based on the seminal Johnson-Lindenstrauss (JL) Lemma, the value of a coordinate is unlikely to deviate from 0 by more than $\frac{2}{\sqrt{D}}$ with sufficiently high probability. For a more detailed and rigorous explanation, please refer to the JL Lemma.

Concentration? So What?

The concentration phenomenon in high-dimensional spaces leads to some intriguing implications. Let’s revisit the example:

When $\mathbf{x}$ has 3 dimensions, the only information we have about $\mathbf{x}[0]$ is that it lies within $[-1, +1]$ , which is not very informative.
When $\mathbf{x}$ has 1000 dimensions, the concentration phenomenon tells us that $\mathbf{x}[0]$ lies within $(-\frac{2}{\sqrt{D}}, \frac{2}{\sqrt{D}})$ with high probability. Since $D$ is large, this represents a much tighter bound on $\mathbf{x}[0]$ in high-dimensional spaces.

This is quite surprising, because we did not access a single bit of data or perform any computation to reach this conclusion. Yet, the uncertainty about $\mathbf{x}[0]$ significantly decreases: its range shrinks from $[-1, 1]$ to $(-\frac{2}{\sqrt{D}}, \frac{2}{\sqrt{D}})$ . In other words, we gain valuable information without doing any computation!

This insight opens up opportunities for improvement in algorithms by leveraging this "free information". One area where this can be particularly beneficial is quantization. By utilizing this phenomenon effectively, we can potentially achieve improvements without additional costs.

RaBitQ

Quantization is generally a family of methods developed for vector compression. In vector search, it uses the compressed vectors for estimating distances or inner products. Our research RaBitQ and its extensions can be basically regarded as optimized approaches to binary and scalar quantization, respectively.

In particular, RaBitQ achieves promising performance by effectively leveraging the "free information" discussed earlier. In this post, we will focus on explaining how this "free information" is utilized, leaving the implementation details aside. For those interested in a deeper dive into the technical aspects, we highly recommend referring to our papers and code repos.

The workflow of quantization in general includes two phase:

Index Phase: In this phase, a quantization codebook is constructed. Each vector in the database is then assigned to its nearest vector in the codebook, which serves as its quantized representation.
Query Phase: In this phase, the quantized vectors are used to approximate metrics such as Euclidean distances, inner products, or cosine similarity.

Let us use Euclidean distance as an example in the following discussion, while other metrics can be easily supported with similar derivation.

Index Phase

Step 1: Normalization

Let $\mathbf{o}_r \text{ and } \mathbf{q}_r$ be a data vector and a query vector respectively. Now we target to estimate their Euclidean distance. Let $\mathbf{c}$ be a centroid of data vectors. To ease the question of distance estimation, we first reduce it to the estimation of inner product between unit vectors. We consider normalzing the vectors with respect to the centroid, i.e., we take $\mathbf{o}:= \frac{\mathbf{o}_r-\mathbf{c}}{||\mathbf{o}_r-\mathbf{c}||}, \mathbf{q}:= \frac{\mathbf{q}_r-\mathbf{c}}{||\mathbf{q}_r-\mathbf{c}||}$ .

Then using the following equation, we can reduce the question of distance estimation to the one of inner product estimation over unit vectors.

|| \mathbf{o}_r-\mathbf{q}_r ||^2 = || {(\mathbf{o}_r - \mathbf{c} )-( \mathbf{q}_r -\mathbf{c} )} || ^2 \newline \ =|| {\mathbf{o}_r -\mathbf{c} }||^2 + || \mathbf{q}_r- \mathbf{c} ||^2 -2 \cdot || \mathbf{o}_r-\mathbf{c} || \cdot || \mathbf{q}_r-\mathbf{c} || \cdot \left< \mathbf{q}, \mathbf{o} \right>

Here $|| {\mathbf{o}_r -\mathbf{c} }||$ can be precomputed in the index phase and $|| {\mathbf{q}_r -\mathbf{c} }||$ can be computed once and shared by all data vectors. Thus, the question is reduced to that of estimating $\left< \mathbf{q}, \mathbf{o} \right>$ . In intuition, normalization can put a cluster of vectors at the center of the space and further align each onto the unit sphere. Thus, this operation spreads the data vectors evenly on the unit sphere.

Step 2: Codebook Construction

Given that the raw data vectors have been converted into unit vectors that spreads evenly on the unit sphere, intuitively, we should also construct a codebook which spreads evenly on the unit sphere. Here we take a natural construction of a hypercube nested in the unit sphere, which is illustrated as follows.

Recall that we target to utilize the "free information" which comes from the randomness. Here we inject the codebook some randomness by randomly rotating it.

Then for the data vector $\mathbf{o}$ , we find its nearest vector $\mathbf{\bar o}$ in the codebook. Because $\mathbf{\bar o}$ is a vector on the hypercube, it can be stored in 1-bit per dimension.

Query Phase: Distance Estimation

Now it's time to construct an estimator for the inner product $\left< \mathbf{q}, \mathbf{o} \right>$ based on the quantized vector. To achieve this, let us first analyze the geometric relationship among the vectors $\mathbf{o}, \mathbf{q} \text{ and } \mathbf{\bar o}$ .

In particular, we find that although $\mathbf{o}, \mathbf{q} \text{ and } \mathbf{\bar o}$ are vectors in high-dimensional spaces. To estimate $\left< \mathbf{q}, \mathbf{o} \right>$ , we only need to focus on the 2-dimensional plane which hosts $\mathbf{o} \text{ and } \mathbf{q}$ , which is illustrated as follows.

Here $\mathbf{e}_1$ is a unit vector which is orthogonal to $\mathbf{o}$ and is in the plane. By analyzing the geometric relationship of the vectors on the plane, we derive the following equation regarding the inner product among the vectors.

\left< \mathbf{\bar o}, \mathbf{q} \right> = \left< \mathbf{\bar o}, \mathbf{o} \right> \cdot \left< \mathbf{o}, \mathbf{q} \right> + \left< \mathbf{\bar o}, \mathbf{e}_1 \right> \cdot \sqrt {1- \left< \mathbf{o}, \mathbf{q} \right>^2 }

This is to say, if we know the values of the variables other than $\left< \mathbf{o, q} \right>$ in the equation, we can exactly compute the value of $\left< \mathbf{o, q} \right>$ by solving the equation.

Here $\left< \mathbf{\bar o}, \mathbf{o} \right>$ can be precomputed during indexing because it is independent of the query. $\left< \mathbf{\bar o}, \mathbf{q} \right>$ can be done during querying with some tricky implementations, which we refer readers to our original papers and code repos. However, the value of $\left< \mathbf{\bar o}, \mathbf{e}_1 \right>$ poses intrinsic hardness in the computation: it depends on both the data vector and query vector which can neither be computed during indexing nor computed efficiently during querying without accessing the original data vector.

Recall that during indexing, we inject the codebook some randomness and the randomness may bring "free information" in high-dimensional spaces. Would it be the case here? We sample many different random rotation matrices and investigate the empirical distribution of the variable $\left< \mathbf{\bar o}, \mathbf{e}_1 \right>$ .

Here in the figure above, each red point represents the projection of a sample of $\mathbf{\bar o}.$ The vertical axis of red point represents the value $\left< \mathbf{\bar o}, \mathbf{e}_1 \right>$ . Note that in this case, $\mathbf{e}_1$ is just a unit vector. The possible region of its projection on the plane is the whole green disk. However, this empirical study shows that the actual region of its projection is around the red point cloud, which is much smaller. Concentration happens.

This result demonstrates that the variable $\left< \mathbf{\bar o}, \mathbf{e}_1 \right>$ is highly concentrated around 0. As a result, treating it as 0 in the computation of our target will produce promising accuracy even though we perform no computation on $\left< \mathbf{\bar o}, \mathbf{e}_1 \right>$ .

Based on this result, we derive the following estimator for our target.

\left< \mathbf{o}, \mathbf{q} \right> \approx \frac{\left< \mathbf{\bar o}, \mathbf{q} \right>}{\left< \mathbf{\bar o}, \mathbf{o} \right>}

By rigorously analyzing the extent of concentration, we prove that the error of the estimation is only

O \left( \frac{1}{\sqrt{D}} \right)

, which achieves the asymptotic optimality in theory. Recall the "free information" discussion above, the error bound exactly matches the one we presented in the beginning: indicating that this algorithm fully utilizes the "free information" gained through the concentration phenomenon in high-dimensional spaces. This is the core reason that RaBitQ achieves great accuracy gain.

More About RaBitQ

So far, we’ve explored the key insights behind RaBitQ. However, this is just the beginning. The RaBitQ paper accompanying code repositories include more innovations in the implementation, optimization and applications. Note that the original RaBitQ only supports binary quantization. Our extended version further provides support of the quantization from 2-bit and beyond, which can been regarded as a more optimized approach to scalar quantization. Besides leveraging the "free information", in intuition, this extension introduces a new strategy which performs scalar quantization by trying different parameters on a per-vector basis, computes the quantization error produced under every certain parameter and selects the optimal parameter and vector in the codebook to minimizes the error. It turns out that (1) using the same compression rates (e.g., 2-bit, 3-bit and 4-bit), it achieves significantly better accuracy and (2) its distance computation can be achieved with exactly the implementation of scalar quantization: it can replace scalar quantization seamlessly.

RaBitQ has been adopted in several real-world systems, demonstrating its broad applicability and impact. For example, Tensorchord develops a highly cost-effective solution of vector search, where RaBitQ is one of the components. They also provide an in-depth blog illustrating how to optimize this algorithm step by step in Rust. Additionally, Elastic incorporated our RaBitQ algorithm into their BBQ feature. Their evaluation further validates the breakthrough performance of RaBitQ compared to the classical Product Quantization. Three months ago (Sep-2024), we updated with an extension of RaBitQ that enhances scalar quantization, and we hope to see its adoption continue to drive performance gains in production systems.

DEV Community

Quantization in The Counterintuitive High-Dimensional Space

A Counterintuitive Phenomenon

But Why?

Concentration? So What?

RaBitQ

Index Phase

Step 1: Normalization

Step 2: Codebook Construction

Query Phase: Distance Estimation

More About RaBitQ

Top comments (0)

Read next

Elon Musk’s Open Source Revolution in the NFT Landscape

Elon Musk and the Open Source Revolution: Pioneering a New Era of Innovation

Magic Eden: Pioneering Open Source Licensing in the Blockchain Era

Microsoft's Journey into Open Source: A Transformational Tale