Siddhant Khare

Posted on Jan 12

Binary embedding: shrink vector storage by 95%

#llm #vectordatabase #rag #softwareengineering

Introduction

When working with large RAG (Retrieval-Augmented Generation) systems, getting the right vector embeddings is key to quickly finding useful bits of text. If you’re using float-based embeddings, which usually take up 32 bits for each dimension, your storage needs can grow really fast as your dataset expands. For example, breaking down a 100 MB collection into about 100,000 parts could lead to hundreds of megabytes just for those embeddings. This extra storage can slow you down, especially if you’re aiming for quick searches and a budget-friendly setup.

Binary Embedding flips the usual float-based representation into bits, potentially shrinking your storage footprint by around 95%. Although a purely binary approach can reduce accuracy, combining it with reranking or hybrid embeddings can recover most of the lost precision. Below, we’ll explore how Binary Embedding works, why it matters, and what you can expect in a real-world RAG setup.

Why Binary embedding is a big deal

Most embeddings come from neural models that output high-dimensional float vectors. For example, OpenAI’s text-embedding-3-small yields 1,536 dimensions, each stored as a 32-bit float (4 bytes). That’s about 6 KB per vector. Chunk a 100 MB document into roughly 100,000 segments, and you’ll need around 600 MB of embedding data—six times the size of your original text.

Binary Embedding addresses this by mapping each dimension into a single bit (0 or 1). This can slash 6 KB down to just 192 bytes (1,536 bits). Multiply that across your entire corpus, and your storage overhead plunges by about 96%. If your infrastructure costs or query latency are dominated by large vector stores, it can be a real game-changer.

Basic RAG Workflow and the Embedding Stage

Receive User Questions – For example, “What was the significance of event X?”
Search for Relevant Info – The query’s embedding is matched against stored embeddings of pre-chunked text.
Combine Question + Relevant Chunks – The chosen chunks are appended to the user’s question and passed to an LLM for an answer.

Step 2 is where embeddings and vector search come into play. Once you’re dealing with millions of chunks, efficient storage and retrieval can be critical to performance.

Experimental setup

Data: wiki40b/en corpus, tested at three scales: 1K, 10K, and 100K entries.
Embedding Model: text-embedding-3-small (1,536 dimensions).
Stored Representation:
- Binary Embedding: Convert float values ≥ 0 to 1, else 0.
- Int8 Embedding: Multiply each float by 127, convert to -128..127.
- Binary + Rerank: Perform binary-based retrieval, then rerank top candidates using original float vectors.
Vector Database: Faiss.
Search Query: “What is [title]?” for each chunk.
Accuracy Metrics:
- Precision: Percentage of queries where the top result was correct.
- MRR (Mean Reciprocal Rank): Evaluates correctness in the top 10 results.

Methods in detail

Binary Embedding

For each float dimension from the original model:

These bits are then packed into bytes for storage. Search uses Hamming distance.

Int8 Embedding

Each float dimension is multiplied by 127 and truncated to a signed 8-bit integer. This retains more nuance than a single bit but still reduces storage relative to 32-bit floats.
Binary + Rerank
1. Use the binary index for fast retrieval.
2. Take the top-k results from that initial pass, then rank them precisely with float vectors (cosine similarity).

This hybrid strategy combines the speed and compactness of binary search with the accuracy of float-based reranking.

Results

name	doc_size	precision	mrr
binary-1000	1000	0.780	0.856940
binary-10000	10000	0.750	0.823898
binary-100000	100000	0.702	0.789464
int8-1000	1000	0.950	0.975000
int8-10000	10000	0.956	0.976667
int8-100000	100000	0.943	0.969783
binary_rerank-1000	1000	1.000	1.000000
binary_rerank-10000	10000	1.000	1.000000
binary_rerank-100000	100000	0.998	0.998000

Observations:

Plain Binary Embedding dips to about 70% precision at 100k data points.
Int8 Embedding maintains around 95% precision, even at larger scales.
Binary + Rerank recovers almost float-level performance, reaching ~99.8% precision at 100k entries.

Interpreting the findings

Binary Embedding: Maximizes storage savings but can reduce accuracy.
Int8 Embedding: Better precision than pure binary, still lighter than full floats.
Binary + Rerank: Retrieves with bits, refines with floats. The tradeoff is computational cost for that final pass.

Hypothesis: For extremely large datasets (e.g., millions of documents), you might need to pull more top-k candidates for reranking, which increases overall latency and compute usage.

Practical considerations

Tradeoff Analysis: If your application can accept some accuracy loss, pure binary could drastically reduce storage.
Reranking Cost: Hybrid methods need extra embedding lookups or computations for the top results.
Caching: Caching float vectors of frequently accessed chunks can reduce reranking overhead.
Database Compatibility: Some vector DBs don’t optimize bitwise operations. Verify compatibility or use custom indexing.
Scaling Up: Larger corpora may require more sophisticated balancing between retrieval speed and rerank accuracy.

💭 Thoughts

Binary Embedding offers a powerful way to compress massive embedding collections. If you can handle a small accuracy hit, or if you’re comfortable using a two-stage approach, you can shrink your embedding store by around 95%. As vector databases evolve, bitwise search may become a standard optimization for resource-intensive RAG solutions. If you’re currently limited by memory or infrastructure costs, flipping floats to bits might just be the smartest compression move you can make.

For more tips and insights, follow me on Twitter @Siddhant_K_code and stay updated with the latest & detailed tech content like this.

DEV Community

Binary embedding: shrink vector storage by 95%

Introduction

Why Binary embedding is a big deal

Basic RAG Workflow and the Embedding Stage

Experimental setup

Methods in detail

Results

Interpreting the findings

Practical considerations

💭 Thoughts

Top comments (0)

Read next

A Magic Line That Cuts Your LLM Latency by >40% on Amazon Bedrock

AI Agents Tools: LangGraph vs Autogen vs Crew AI vs OpenAI Swarm- Key Differences

Code That Belongs in a Museum, Not a Repository

Empower Your Team: Deploy Local LLMs in Microsoft Word on Your Intranet