Posted on Feb 16

Ingesting documents using .NET to build a simple Retrieval Augmented Generation (RAG) system

#dotnet #aspire #ai #rag

Here is a quick post summarising how to use .NET Semantic Kernel, Qdrant and .Net to ingest markdown documents. One of the comments a recent post related to the topic was about why using Python for ingestion instead of .NET. That was a personal preference at the time but also using .NET with Semantic Kernel to ingest documents for a simple pipeline is not necessarily any more work.

In this post, we will go through the ingestion process utilising high level libraries available to us in .NET ecosystem.

.NET Semantic Kernel and related connectors for managing vector store
LangChain .NET for chunking
.NET Aspire to bring it all together using one of the Inference APIs. (Ollama on host, Ollama as container managed by ASPIRE or OpenAI)

Use case

In the Python version, we can either pull the documents from a GitHub Repository or use a file generated by GitIngest UI. HitIngest is an open source library allowing consumers to integrate ability to scrape public repositories from GitHub or manually downloading a file using the Web UI linked earlier.

In this case, we have a single File that contains markdown and .yml files from Official .NET Aspire Documentation Repository. This file is generated by GitIngest UI and contains around 180 files concatenated into a single text file.

Ingestion Process

File Format.

The ingestion process in this example is straightforward and we follow the steps illustrated below.

Splitting actual files

As we are using a single file containing multiple .md and .yml files as described above, first step is to split them into filename, file content pairs.

The files are separated by headers as following:

... content
================================================
File: README.md
================================================
... content

Given this is a throw away example, code below is just enough to demonstrate the process without much distractions.

public static class GitIngestFileSplitter
{
    private const string SeparatorLine = "=====================";
    private const string FilePrefix = "File:";

    public static Dictionary<string, string> ParseContent(string content)
    {
        // declarations omitted 
        foreach (var line in lines)
        {
            if (line.Trim().Contains(SeparatorLine))
            {
                if (currentFileName != null && isCollectingContent && !skipNextSeperatorLine)
                {
                    result[currentFileName] = contentBuilder.ToString().TrimEnd();
                    contentBuilder.Clear();
                    currentFileName = null;
                    isCollectingContent = false;
                    skipNextSeperatorLine = false;
                    continue;
                }
            }
            switch (isCollectingContent)
            {
                case false when line.StartsWith(FilePrefix):
                    currentFileName = line.Replace(FilePrefix,"").Trim();
                    isCollectingContent = true;
                    skipNextSeperatorLine = true;
                    continue;
                case true when currentFileName != null:
                {
                    skipNextSeperatorLine = false;
                    if (!line.Trim().Contains(SeparatorLine) && !string.IsNullOrWhiteSpace(line))
                    {
                        contentBuilder.AppendLine(line);
                    }

                    break;
                }
            }
        }

        // Don't forget to add the last file if there is one
        if (currentFileName != null && contentBuilder.Length > 0)
        {
            result[currentFileName] = contentBuilder.ToString().TrimEnd();
        }
        return result;
    } 
}

Chunking

Now that we have a Dictionary of file names and file content, we now need to get chinks for the file contents.

In this case, I have opted to experiment with LangChain .NET project
We are using MarkdownHeaderTextSplitter and CharacterTextSplitter from LangChain .NET.

...
public class GitIngestChunker : IChunker
{
    // declarations / constructor omitted.
    public async IAsyncEnumerable<FileChunks> GetChunks(string gitIngestFilePath)
    {
        // Read the text file (this is the single file containing all markdown files)
        var gitIngestFileContent = await File.ReadAllTextAsync(gitIngestFilePath);
        // Split the files as discussed earlier
        var files = GitIngestFileSplitter.ParseContent(gitIngestFileContent);
        // Start chunking each split file.
        foreach (var file in files)
        {
            using var chunkingTimer = new MetricTimer(_metrics, MetricNames.Chunking);            
            // omitted: get TextSplitter for given file type.            
            var fileChunks = new FileChunks(file.Key, []);
            var chunks = splitter.SplitText(file.Value);
            // we are using markdown header splitter. So if generated chinks are large, we need to keep chunking them.
            if(chunks.Any(x=>x.Length>600))
            {
                foreach (var chunk in chunks)
                {
                    if(chunk.Length>600)
                    {
                        var subChunks = _characterSplitter.SplitText(chunk);
                        fileChunks.Chunks.AddRange(subChunks);
                    }else{
                        fileChunks.Chunks.Add(chunk);
                    }
                }
            }
            else
            {
                foreach (var chunk in chunks)
                {
                    fileChunks.Chunks.Add(chunk);
                }
            }
            // return the chunks representing the current markdown or yml file
            yield return fileChunks;
        }
    }
    public bool CanChunk(DocumentType documentType)
    {
        return documentType == DocumentType.GitIngest;
    }
}

Getting embedding for the chunks

We are using Semantic Kernel so this part is straightforward and will work with whichever API we chose to use. Given we have so far split the file, and got the chunks for each document, we can use the registered ITextEmbeddingGenerationService (this is driven by app and aspire configuration) to compute the embeddings using the inference approach we have configured.

We also have some custom metrics we are tracking that are visible on Aspire Dashboard as we perform ingestion.

...
public class IngestionPipeline(
    Kernel kernel, ...
{
    private readonly ITextEmbeddingGenerationService _embeddingGenerator =
        kernel.GetRequiredService<ITextEmbeddingGenerationService>();

    public async Task IngestDataAsync(string filePath, DocumentType documentType)
    {
        ... get chunks
        await foreach (var fileChunk in documentChunker.GetChunks(filePath))
        {
            IList<ReadOnlyMemory<float>>? embeddings = null;

            using (new MetricTimer(metrics,
                       MetricNames.Embedding, new KeyValuePair<string, object?>("File", filePath),
                       new KeyValuePair<string, object?>("EmbeddingModel", configuration.Value.EmbeddingModel)))
            {
                embeddings = await _embeddingGenerator.GenerateEmbeddingsAsync(fileChunk.Chunks);
            }
            ... rest of the method
        }
    }    
    ... rest of the class
}

Inserting the vectors

Now that we have the embeddings, we need to insert them. This process involves a few steps:

Mapping a .NET class to a vector store document
Ensuring the Collection exists (optionally recreated)
Using correct dimensions for the collection which depends on what embedding model we use.

Mapping

Microsoft has good documentation on how to build custom mappers for Vector Store Connectors so I will not repeat it here. However at a high level, it is important to cover some of the aspects.

We can use attributes for mapping but in this demo we can use multiple embedding models and they have different dimensions for embedding vectors so using attributes would mean hardcoding these.

We can however define our VectorStoreRecordDefinition in code so that we can at runtime chose the correct dimensions for our collection.

So our mapping can be as simple as the following snippet from QdrantCollectionFactory.cs:

    private static readonly Dictionary<string, int> EmbeddingModels = new()
    {
        { "mxbai-embed-large", 1024 },
        { "nomic-embed-text", 768 },
        { "granite-embedding:30m", 384 }
    };


    private readonly VectorStoreRecordDefinition _faqRecordDefinition = new()
    {
        Properties = new List<VectorStoreRecordProperty>
        {
            new VectorStoreRecordKeyProperty("Id", typeof(Guid)),
            new VectorStoreRecordDataProperty("Content",
                typeof(string)) { IsFilterable = true, StoragePropertyName = "page_content" },
            new VectorStoreRecordDataProperty("Metadata", typeof(FileMetadata))
            {
                IsFullTextSearchable = false, StoragePropertyName = "metadata"
            },
            new VectorStoreRecordVectorProperty("Vector", typeof(float))
            {
                Dimensions = EmbeddingModels.ContainsKey(embeddingModel) ? EmbeddingModels[embeddingModel] : 384,
                DistanceFunction = DistanceFunction.CosineSimilarity, IndexKind = IndexKind.Hnsw,
                StoragePropertyName = "page_content_vector"
            },
        }
    };

When bootstrapping we can then use our factory and register it with .NET Semantic Kernel so whenever we inject and IVectorStore we will have our mappers integrated in the pipeline.

        var options = new QdrantVectorStoreOptions
        {
            HasNamedVectors = true,
            VectorStoreCollectionFactory = new QdrantCollectionFactory(embeddingModelName)
        };
        kernelBuilder.AddQdrantVectorStore(options: options);
    }

Inserting vectors to our collection

Once we handle the registration and configuration, we are ready to consume IVectorStore in our code and make use of it. So in our IngestionPipeline.cs we need to perform the following:

Ensure collection exits:
- Create if it does not or recreate if required.
Insert the vectors as below:

// .NET Semantic Kernel is experimental so we need to opt in to use it.
#pragma warning disable SKEXP0001
.... code omitted
public class IngestionPipeline(
    IVectorStore vectorStore,
    AspireRagDemoIngestionMetrics metrics)
{
    private readonly IVectorStoreRecordCollection<Guid, FaqRecord> _faqCollection = vectorStore.GetCollection<Guid, FaqRecord>(configuration.Value.VectorStoreCollectionName );
    public async Task IngestDataAsync(string filePath, DocumentType documentType)
    {
        await EnsureCollectionExists(true);
        var documentsProcessed = 0;
        .... code omitted
        using var ingestionTimer = new MetricTimer(metrics,
            MetricNames.DocumentIngestion, new KeyValuePair<string, object?>("File", filePath),
            new KeyValuePair<string, object?>("EmbeddingModel", configuration.Value.EmbeddingModel));
        await foreach (var fileChunk in documentChunker.GetChunks(filePath))
        {               metrics.RecordProcessedChunkCount(fileChunk.Chunks.Count);
            for (var i = 0; i < fileChunk.Chunks.Count; i++)
            {
                    try
                    {
                        var faqRecord = new FaqRecord()
                        {
                            Id = Guid.NewGuid(),
                            Content = fileChunk.Chunks[i],
                            Vector = embeddings[i],
                            Metadata = new FileMetadata()
                            {
                                FileName = new StringValue() { Value = fileChunk.FileName }
                            }
                       };
                        await _faqCollection.UpsertAsync(faqRecord);
                    }
                }
            }
            documentsProcessed++;
        }
        metrics.RecordProcessedDocumentCount(documentsProcessed);
    }

    private async Task EnsureCollectionExists(bool forceRecreate = false)
    {
        var collectionExists = await _faqCollection.CollectionExistsAsync();
        switch (collectionExists)
        {
            case true when !forceRecreate:
                return;
            case true:
                await _faqCollection.DeleteCollectionAsync();
                break;
        }

        await _faqCollection.CreateCollectionAsync();
    }
}

Summary

In this quick post we have covered using TextSplitters from LangChain .NET, Vector Stores and Embedding models via .NET Semantic Kernel and some custom metrics captured during ingestion.

Without much code, we can get impressive results using what is available to us in .NET world and if you would like to see the results here is how to:

Clone the repository
Use http-ollama-local configuration in the AppHost Project.
Run the aspire project
Wait for models to one downloaded and started
Then use the src/AspireRagDemo.API/AspireRagDemo.API.http and execute http://localhost:5026/ingest?fileName=dotnet-docs-aspire.txt call. Depending on model size and CPU, tis can take somewhere between 30 seconds to 15 minutes.
Once ingestion completed, access the UI from Aspire Dashboard and run some Aspire Related queries.

In addition, feel free to explore the metrics as below:

DEV Community

Ingesting documents using .NET to build a simple Retrieval Augmented Generation (RAG) system

Use case

Ingestion Process

File Format.

Splitting actual files

Chunking

Getting embedding for the chunks

Inserting the vectors

Mapping

Inserting vectors to our collection

Summary

Top comments (0)

Read next

Does our technology still work for us?

Socialize: Optimize your social life!

5 cool things from CES for Amazon Developers (plus 4 more!)

Could DeepSeek Be the Democratization of AI?