Simplr

Posted on Feb 2 • Originally published at blog.simplr.sh

The Best Way to Chunk Text Data for Generating Embeddings with OpenAI Models

#openai #typescript #tutorial #ai

When working with OpenAI's embedding models, such as text-embedding-3-small or text-embedding-3-large or text-embedding-ada-002, one of the most critical steps is chunking your text data. Chunking ensures that your text fits within the model's token limit while preserving context for downstream tasks like semantic search, clustering, or recommendation systems.

In this article, we'll explore the best practices for chunking text data and walk through a practical implementation in TypeScript. By the end, you'll understand why chunking is necessary, how to do it effectively, and how to generate embeddings for your text dataset.

Why Chunking Matters

OpenAI's embedding models have a token limit for each input. For example, text-embedding-3-small supports up to 8191 tokens per request. If your text exceeds this limit, you'll need to split it into smaller chunks.

However, chunking isn't just about staying within token limits. It's also about:

Preserving Context: Concepts in text often span multiple sentences or paragraphs. Proper chunking ensures that the context is preserved across boundaries.
Improving Embedding Quality: Smaller, well-structured chunks lead to embeddings that better capture the semantic meaning of the text.
Optimizing Performance: Sending smaller chunks reduces the risk of hitting API limits and improves processing efficiency.

Best Practices for Chunking Text

Here are some best practices to follow when chunking text for embeddings:

Use Token-Based Chunking: OpenAI models process text as tokens, not characters. A token is a unit of text (e.g., a word, part of a word, or punctuation). Using token-based chunking ensures you stay within the model's limits.
Set an Optimal Chunk Size: A chunk size of 1000 tokens is a good starting point for most use cases. This size balances context preservation and computational efficiency.
Add Overlap Between Chunks: Adding a small overlap (e.g., 20% of the chunk size) ensures that important context at the boundaries of chunks is not lost.
Use Smart Splitters: Split text at logical boundaries, such as paragraphs or sentences, to avoid cutting off ideas mid-way.

Implementing Chunking in TypeScript

Let's dive into a practical implementation of chunking and embedding generation using TypeScript. We'll use the langchain library for chunking and tiktoken for token counting.

Step 1: Install Dependencies

First, install the required libraries:

npm install langchain tiktoken

langchain: Provides utilities for text splitting and working with embeddings.
tiktoken: OpenAI's official tokenizer for accurate token counting.

Step 2: Chunking Text with Token-Based Splitting

Here's the full implementation:

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { encoding_for_model } from "tiktoken";

async function generateChunksAndEmbeddings(text: string) {
  // Initialize tiktoken encoder for the embedding model
  const encoder = encoding_for_model("text-embedding-3-small");

  const textSplitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000, // Maximum tokens per chunk
    chunkOverlap: 200, // 20% overlap to preserve context
    separators: ["\n\n", "\n", " ", ""], // Logical split points
    lengthFunction: (text) => {
      // Get accurate token count using tiktoken
      const tokens = encoder.encode(text);
      return tokens.length;
    },
  });

  // Don't forget to free the encoder when done
  encoder.free();

  // Split the text into chunks
  const chunks = await textSplitter.createDocuments([text]);

  // Initialize OpenAI embeddings
  const embeddings = new OpenAIEmbeddings({
    modelName: "text-embedding-3-small",
  });

  // Generate embeddings for each chunk
  const vectorStore = await embeddings.embedDocuments(
    chunks.map((chunk) => chunk.pageContent)
  );

  return { chunks, vectorStore };
}

Step 3: Breaking Down the Code

1. Token Counting with `tiktoken`

const encoder = encoding_for_model("text-embedding-3-small");

We use tiktoken to count tokens accurately. This ensures that our chunks respect the model's token limit. The encoding_for_model function initializes the tokenizer for the specific model.

2. Recursive Character Splitting

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000, // Maximum tokens per chunk
  chunkOverlap: 200, // 20% overlap to preserve context
  separators: ["\n\n", "\n", " ", ""], // Logical split points
  lengthFunction: (text) => {
    const tokens = encoder.encode(text);
    return tokens.length;
  },
});

The RecursiveCharacterTextSplitter splits text into chunks based on logical separators (e.g., paragraphs, sentences, or words). It ensures that chunks are no larger than 1000 tokens and adds a 200-token overlap for context preservation.

The lengthFunction uses tiktoken to count tokens instead of characters, ensuring precise chunk sizing.

3. Freeing the Encoder

encoder.free();

After tokenizing, we free the encoder to release memory. This is a best practice when working with tiktoken.

4. Generating Embeddings

const embeddings = new OpenAIEmbeddings({
  modelName: "text-embedding-3-small",
});

const vectorStore = await embeddings.embedDocuments(
  chunks.map((chunk) => chunk.pageContent)
);

We use the OpenAIEmbeddings class to generate embeddings for each chunk. The embedDocuments method processes all chunks and returns their embeddings.

Step 4: Using the Function

Here's how you can use the generateChunksAndEmbeddings function:

const text = `
  OpenAI's embedding models are powerful tools for understanding and processing text.
  They can be used for tasks like semantic search, clustering, and recommendation systems.
  However, to use them effectively, you need to chunk your text data properly.
`;

const { chunks, vectorStore } = await generateChunksAndEmbeddings(text);

console.log("Chunks:", chunks);
console.log("Embeddings:", vectorStore);

Why This Approach Works

Token-Based Splitting: Ensures chunks fit within the model's token limit.
Overlap: Preserves context across chunk boundaries.
Logical Splitting: Avoids cutting off sentences or paragraphs mid-way.
Efficient Embedding Generation: Processes chunks in parallel for better performance.

Conclusion

Chunking is a critical step when working with OpenAI's embedding models. By following the best practices outlined in this article, you can ensure that your text data is processed efficiently and that the resulting embeddings are high-quality.

The provided TypeScript implementation is production-ready and leverages the latest tools (langchain and tiktoken) to handle chunking and tokenization accurately. Whether you're building a semantic search engine or a recommendation system, this approach will set you up for success.

Happy embedding! 🚀

DEV Community

The Best Way to Chunk Text Data for Generating Embeddings with OpenAI Models

Why Chunking Matters

Best Practices for Chunking Text

Implementing Chunking in TypeScript

Step 1: Install Dependencies

Step 2: Chunking Text with Token-Based Splitting

Step 3: Breaking Down the Code

1. Token Counting with `tiktoken`

2. Recursive Character Splitting

3. Freeing the Encoder

4. Generating Embeddings

Step 4: Using the Function

Why This Approach Works

Conclusion

Top comments (0)

Read next

Streaming Responses in AI: How AI Outputs Are Generated in Real-Time

From DeLorean to Waymo: The Journey to Autonomous Vehicles

Qwen 2.5 Max vs. DeepSeek V3 (R1) Benchmark

DeepSeek Model DOES NOT censor Tiananmen Square

Why Chunking Matters

Best Practices for Chunking Text

Implementing Chunking in TypeScript

Step 1: Install Dependencies

Step 2: Chunking Text with Token-Based Splitting

Step 3: Breaking Down the Code

1. Token Counting with tiktoken

2. Recursive Character Splitting

3. Freeing the Encoder

4. Generating Embeddings

Step 4: Using the Function

Why This Approach Works

Conclusion

Read next

Streaming Responses in AI: How AI Outputs Are Generated in Real-Time

From DeLorean to Waymo: The Journey to Autonomous Vehicles

Qwen 2.5 Max vs. DeepSeek V3 (R1) Benchmark

DeepSeek Model DOES NOT censor Tiananmen Square

1. Token Counting with `tiktoken`