When working with OpenAI's embedding models, such as text-embedding-3-small
or text-embedding-3-large
or text-embedding-ada-002
, one of the most critical steps is chunking your text data. Chunking ensures that your text fits within the model's token limit while preserving context for downstream tasks like semantic search, clustering, or recommendation systems.
In this article, we'll explore the best practices for chunking text data and walk through a practical implementation in TypeScript. By the end, you'll understand why chunking is necessary, how to do it effectively, and how to generate embeddings for your text dataset.
Why Chunking Matters
OpenAI's embedding models have a token limit for each input. For example, text-embedding-3-small
supports up to 8191 tokens per request. If your text exceeds this limit, you'll need to split it into smaller chunks.
However, chunking isn't just about staying within token limits. It's also about:
- Preserving Context: Concepts in text often span multiple sentences or paragraphs. Proper chunking ensures that the context is preserved across boundaries.
- Improving Embedding Quality: Smaller, well-structured chunks lead to embeddings that better capture the semantic meaning of the text.
- Optimizing Performance: Sending smaller chunks reduces the risk of hitting API limits and improves processing efficiency.
Best Practices for Chunking Text
Here are some best practices to follow when chunking text for embeddings:
- Use Token-Based Chunking: OpenAI models process text as tokens, not characters. A token is a unit of text (e.g., a word, part of a word, or punctuation). Using token-based chunking ensures you stay within the model's limits.
- Set an Optimal Chunk Size: A chunk size of 1000 tokens is a good starting point for most use cases. This size balances context preservation and computational efficiency.
- Add Overlap Between Chunks: Adding a small overlap (e.g., 20% of the chunk size) ensures that important context at the boundaries of chunks is not lost.
- Use Smart Splitters: Split text at logical boundaries, such as paragraphs or sentences, to avoid cutting off ideas mid-way.
Implementing Chunking in TypeScript
Let's dive into a practical implementation of chunking and embedding generation using TypeScript. We'll use the langchain
library for chunking and tiktoken
for token counting.
Step 1: Install Dependencies
First, install the required libraries:
npm install langchain tiktoken
-
langchain
: Provides utilities for text splitting and working with embeddings. -
tiktoken
: OpenAI's official tokenizer for accurate token counting.
Step 2: Chunking Text with Token-Based Splitting
Here's the full implementation:
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { encoding_for_model } from "tiktoken";
async function generateChunksAndEmbeddings(text: string) {
// Initialize tiktoken encoder for the embedding model
const encoder = encoding_for_model("text-embedding-3-small");
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // Maximum tokens per chunk
chunkOverlap: 200, // 20% overlap to preserve context
separators: ["\n\n", "\n", " ", ""], // Logical split points
lengthFunction: (text) => {
// Get accurate token count using tiktoken
const tokens = encoder.encode(text);
return tokens.length;
},
});
// Don't forget to free the encoder when done
encoder.free();
// Split the text into chunks
const chunks = await textSplitter.createDocuments([text]);
// Initialize OpenAI embeddings
const embeddings = new OpenAIEmbeddings({
modelName: "text-embedding-3-small",
});
// Generate embeddings for each chunk
const vectorStore = await embeddings.embedDocuments(
chunks.map((chunk) => chunk.pageContent)
);
return { chunks, vectorStore };
}
Step 3: Breaking Down the Code
1. Token Counting with tiktoken
const encoder = encoding_for_model("text-embedding-3-small");
We use tiktoken
to count tokens accurately. This ensures that our chunks respect the model's token limit. The encoding_for_model
function initializes the tokenizer for the specific model.
2. Recursive Character Splitting
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // Maximum tokens per chunk
chunkOverlap: 200, // 20% overlap to preserve context
separators: ["\n\n", "\n", " ", ""], // Logical split points
lengthFunction: (text) => {
const tokens = encoder.encode(text);
return tokens.length;
},
});
The RecursiveCharacterTextSplitter
splits text into chunks based on logical separators (e.g., paragraphs, sentences, or words). It ensures that chunks are no larger than 1000 tokens and adds a 200-token overlap for context preservation.
The lengthFunction
uses tiktoken
to count tokens instead of characters, ensuring precise chunk sizing.
3. Freeing the Encoder
encoder.free();
After tokenizing, we free the encoder to release memory. This is a best practice when working with tiktoken
.
4. Generating Embeddings
const embeddings = new OpenAIEmbeddings({
modelName: "text-embedding-3-small",
});
const vectorStore = await embeddings.embedDocuments(
chunks.map((chunk) => chunk.pageContent)
);
We use the OpenAIEmbeddings
class to generate embeddings for each chunk. The embedDocuments
method processes all chunks and returns their embeddings.
Step 4: Using the Function
Here's how you can use the generateChunksAndEmbeddings
function:
const text = `
OpenAI's embedding models are powerful tools for understanding and processing text.
They can be used for tasks like semantic search, clustering, and recommendation systems.
However, to use them effectively, you need to chunk your text data properly.
`;
const { chunks, vectorStore } = await generateChunksAndEmbeddings(text);
console.log("Chunks:", chunks);
console.log("Embeddings:", vectorStore);
Why This Approach Works
- Token-Based Splitting: Ensures chunks fit within the model's token limit.
- Overlap: Preserves context across chunk boundaries.
- Logical Splitting: Avoids cutting off sentences or paragraphs mid-way.
- Efficient Embedding Generation: Processes chunks in parallel for better performance.
Conclusion
Chunking is a critical step when working with OpenAI's embedding models. By following the best practices outlined in this article, you can ensure that your text data is processed efficiently and that the resulting embeddings are high-quality.
The provided TypeScript implementation is production-ready and leverages the latest tools (langchain
and tiktoken
) to handle chunking and tokenization accurately. Whether you're building a semantic search engine or a recommendation system, this approach will set you up for success.
Happy embedding! 🚀
Top comments (0)