DEV Community

Cover image for Building a simple Retrieval Augmented Generation system using .Net Aspire
sy
sy

Posted on

Building a simple Retrieval Augmented Generation system using .Net Aspire

In this post, we will look into building a simple Retrieval Augmented Generation (RAG) system where we use Jupyter Notebooks for ingestion and .NET Web API for retrieval and generation part using .NET Aspire and having telemetry from both Python and C# components of the system.

We will be looking into the following components to build our system:

  • Vector store: Qdrant with Aspire.Hosting.Qdrant package.
  • Ingestion: Jupyter Notebooks
    • Langchain for ingestion and OpenTelemetry to ensure
  • Experimental UI: Streamlit.
  • Embeddings and Generative models:
    • Ollama using CommunityToolkit.Aspire.Hosting.Ollama package.
    • Ollama hosted on the development machine (without Docker)
    • OpenAI
    • HuggingFace
  • API: Asp.Net Web API with .Net 9
    • Microsoft Semantic Kernel.

There are several posts about how to integrate Ollama, OpenAI, Semantic Kernel and emerging open source models. This post will focus on how Aspire 9 networking enhancements help us to build and debug the systems where we might be using multiple languages and frameworks as well as being able to switch our models and model providers with a few lines of configuration change. In additional, we will also look into how to utilise hardware acceleration when such improvements are not available via Docker (mainly MacOS devices)

This post will focus on the following:

  • How we can use .Net Aspire for polyglot solutions where some components might be better off in different programming languages.
  • How improved Docker network support in Aspire 9 helps us?
  • How to utilise power of configuration in Aspire to be able to run Ollama as a container or as an application on the host machine without changing any code.
    • Likewise, how to swap Ollama with OpenAI or HuggingFace inference endpoints.

The use case in this post is ingesting .Net Aspire documentation repository and using a RAG approach to answer questions about .Net Aspire. Building such a system is easy but not necessarily helpful if we don't have any metrics to measure success. We are not covering evaluation on this post and that will be the main subject of the next post on the topic. To achieve our use case, we will be utilising Gitingest which is a Python library that helps scraping Github repositories in a format easy for parsing and ingesting. The python code can use the library directly or can consume a text file produced by Gitingest.

Retrieval Augmented Generation (RAG)

There is almost universal awareness that the technology and architecture behind Large Language Models (LLMs) is prone to being creative and making things up. Likes of Gary Marcus and Grady Booch have been trying to raise awareness on what the architectures enabling LLMs are and what they are not.

So if a given technology is good at some areas and have well known limitations on other areas such as being creative with facts, how can we utilise the strengths of such technologies?

One of the approaches is Retrieval Augmented Generation. One of the earlier papers using the term "Retrieval Augmented Generation" is “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” by Lewis, Patrick et al. 2020

A simple RAG system involves in context learning where a general purpose LLM can be used to summarise / extract the answer to a question given related context retrieved using a vector store. In the next section we'll cover these building blocks.

Embeddings and Vector Storage

R in RAG stands for Retrieval. This is where the strength of the approach comes from. Given a repository of data, if we can get relevant answers (context - from a vector store), then we can use a generative model to provide the answer we are looking for given a number of matches to our query.

The ingestion process and impact of chunking

Ingestion for RAG

Ingestion process involves:

  • Enumerating the documents (in this case we are dealing with text only where multiple markdown and yml files are merged into a single text file)
  • Breaking them down into chunks to make them manageable (chunking).
    • Typically, there is a lot to consider here. Some documents have hierarchy which can be utilised when chunking and some documents might be ok with simple breaking down by a fixed size.
  • Getting the embeddings using embedding model (this needs to be use for the retrieval stage later too)
  • Adding them to our Vector Store.

For more information on chunking, follow the links at the bottom of the post.

Retrieval and Generation

Retrieval and Generation in RAG

Once the data is ingested and vector store is up to date, then we can query our RAG system as illustrated in the above diagram.

The steps are:

  • Get embeddings for the query using the same embedding model utilised for ingestion.
  • Query the vector store for n nearest results matching our input.
  • Build the context and run our prompt against our generative model.

.Net Aspire, .NET and Python

There is experimental support for running Python projects as executable in an Aspire Application Host. However, it is also possible to run containers from a Dockerfile which can provide more flexibility.

Jupyter Notebooks

We could run directly using a prebuilt image. However, if we need additional modules or any customisation, then using our Dockerfile and requirements file will ensure our notebook is available immediately (once built) so that we don't have to install same packager each time container is recreated.

FROM quay.io/jupyter/minimal-notebook:python-3.12.8

USER root
RUN apt-get update && apt-get install -y libmagic-dev

RUN mkdir /app
COPY requirements.txt /app

RUN pip install -r /app/requirements.txt

USER ${NB_UID}
ENTRYPOINT ["start-notebook.sh"]
Enter fullscreen mode Exit fullscreen mode

We can then run this as container in our AppHost project:


var jupyter = builder
    .AddDockerfile(Constants.ConnectionStringNames.JupyterService, "./Jupyter")
    .WithBuildArg("PORT", applicationPorts[Constants.ConnectionStringNames.JupyterService])    
    .WithArgs($"--NotebookApp.token={jupyterLocalSecret.Resource.Value}")
    .WithBindMount("./Jupyter/Notebooks/","/home/jovyan/work")

Enter fullscreen mode Exit fullscreen mode

Replit UI

Given this was intended as a quick experimentation to understand how the pieces plug together, using Replit made sense.

However, as Replit is started using a CLI it seemed easier to run it as a container too.

FROM python:3.9-slim
ARG PORT=8501
ENV APP_PORT=$PORT
WORKDIR /app

COPY requirements.txt /app
RUN pip3 install -r /app/requirements.txt

COPY main.py /app
COPY TraceSetup.py /app
COPY entrypoint.sh /app
RUN chmod +x /app/entrypoint.sh

EXPOSE ${PORT}

RUN groupadd -r -g 65532 replitui && useradd --create-home --shell /bin/bash --uid 65532 -g replitui ui_user
USER 65532:65532

HEALTHCHECK CMD curl --fail http://localhost:${PORT}/_stcore/health

ENTRYPOINT [ "bash", "/app/entrypoint.sh"] 
Enter fullscreen mode Exit fullscreen mode

Using the framework, it only takes a few lines of code to have the basic components needed for our UI. The Python and bash code are linked at the corresponding branch for this article.

Web API with Semantic Kernel

To utilise Semantic Kernel, we need to define our prompt as well as the Prompt Template. For the RAG query, it is defined as below:

In our prompt, we define the input placeholders for context and question. Then in the PromptTemplateConfig, we link the prompt as well as defining two input arguments for runtime.

    private const string RagPromptTemplate = """
                                             You are a helpful AI assistant specialised in technical questions and good at utilising additional technical resources provided to you as additional context.
                                             Use the following context to answer the question. You always bringing necessary references.
                                             You prefer a good summary over a long explanation but also provide clear justification for the answer.
                                             If the question has absolutely no relevance to the context, please answer "I don't know the answer."
                                             Please do not include the question in the answer. You can sometimes make educated guesses if the context can imply the answer.

                                             Context:
                                             {{$context}}

                                             Question:
                                             {{$question}}                                             
                                             """;

    /// <summary>
    /// To answer the question, the AI assistant will use the provided context.
    /// </summary>
    public static readonly PromptTemplateConfig RagPromptConfig = new()
    {
        Template = RagPromptTemplate,
        InputVariables =
        [
            new InputVariable { Name = "context" },
            new InputVariable { Name = "question" }
        ]
    };
Enter fullscreen mode Exit fullscreen mode

With the configuration out of the way, we can build a compact C# class that puts it all together for us as below. The notable sections are:

  • GetContextFromVectorStore where we query our Vector Store by getting embeddings for the user's question.
  • In the method AnswerWithAdditionalContext, we then create a kernel function and execute it by passing arguments containing user's question and additional context retrieved from our vector store.
// omit using
#pragma warning disable SKEXP0001
public class ChatClient(
    Kernel kernel,
    IVectorStore vectorStore, 
    IOptions<ModelConfiguration> configuration,
    ILogger<ChatClient> logger) : IChatClient
{
    private const short TopSearchResults = 20;
    private readonly ITextEmbeddingGenerationService _embeddingGenerator = ...

    private readonly IVectorStoreRecordCollection<Guid, FaqRecord> _faqCollection = ....

    // additional methods omitted for brevity.

    private async Task<string> AnswerWithAdditionalContext(string context, string question)
    {
        var arguments = new KernelArguments
        {
            { "context", context },
            { "question", question }
        };

        var kernelFunction = kernel.CreateFunctionFromPrompt(PromptConstants.RagPromptConfig);
        var result = await kernelFunction.InvokeAsync(kernel, arguments);
        return result.ToString();
    }

    /// <summary>
    /// Get context from the vector store based on the question.
    ///  This method uses the vector store to search for the most relevant context based on the question:
    ///      1. Retrieve the embeddings using the embedding model
    ///      2. Search the vector store for the most relevant context based on the embeddings.
    ///      3. Return the context as a string.
    /// </summary>
    /// <param name="question"></param>
    /// <returns>Vector Search Results.</returns>
    private async Task<string> GetContextFromVectorStore(string question)
    {
        var questionVectors =
            await _embeddingGenerator.GenerateEmbeddingsAsync([question]);

        var stbContext = new StringBuilder();

        var searchResults = await _faqCollection.VectorizedSearchAsync(questionVectors[0],
            new VectorSearchOptions() { Top = TopSearchResults });

        await foreach (var item in searchResults.Results)
        {
            stbContext.AppendLine(item.Record.Content);
        }

        return stbContext.ToString();
    }
}
Enter fullscreen mode Exit fullscreen mode

With little code, we have a RAG system functioning even with a barebones UI for local testing.

Aspire - Docker Networking: Communication Between Components

There are three different ways for application components to communicate:

  • Container to container
  • Container to host
  • Host to container

Aspire 9 creates a Docker network which supports all these communication options.

Demo application networking

Container to container

When using Docker compose, we can use service names to connect from one container to another container on the same Docker network.

In the demo application Jupyter Notebook container can connect to Ollama (if running as container - more on this later) and Qdrant container using their service names.

Container to host

Aspire Dashboard in our project runs as an executable as opposed to container. which means, if containers to Use OpenTelemetry, then OTLP endpoint telemetry on Aspire Dashboard running as executable on our host machine needs to be accessible from the containers.

In this case, it is not possible to use localhost as destination in containers so we can use host.docker.internal as the OTEL collector url from containers. This way containers can reach services running on host machine too.

Host to container

This is the case where our .Net Web Api project which is running as an executable process on our host machine and can access all services using localhost and corresponding ports.

Running Ollama as a container or not?

Just because we can run everything as containers does not mean we should run it as a container.

Hardware accelerated Docker

Currently, it is possible to utilise the GPU on Docker hosting NVIDIA Docker. This setup requires device running Linux (or Windows + WSL 2 configured correctly).

When this is the case, running Ollama as a container makes sense.

There are times, when the host machine supports hardware acceleration for Ollama when running on the host but not when running as containers.

For instance:

  • ARM based MacBook Pro and other MacOS devices.
    • Ollama supports acceleration and depending on the specs, can make a huge difference.
    • However as hardware acceleration is not supported by Docker in these operating systems, running Ollama in Docker (with or without Aspire) will end up being much slower.
  • Similarly on Windows devices where there is a dedicated NVIDIA GPU but no NVIDIA Docker support, running Ollama on the host OS will provide better performance.

Our example project also allows for the following set up with a configuration change:

Ollama running on host

Switching Models and Model providers

Given .NET Aspire shines as a development time orchestration framework, it is no wonder the configuration system is also powerful but simple.

In this project we can conditionally spin up Ollama (if needed) or inject connection string for Ollama running on Host using launchSettings.json on App Host. In addition, the models used for embeddings and generation can be easily swapped too and both Python and .NET based components will use whichever values injected via configuration at runtime.

In addition, it is also possible to use OpenAI for both embeddings and generation using the configuration. In such case, we do need to set up developer secrets to contain a valid OpenAI key.

The main driver for our solution is the launchsettings.json file included with Aspire AppHost project. By modifying this, all our components will utilise the desired models or providers.

{
  "$schema": "https://json.schemastore.org/launchsettings.json",
  "profiles": {
    "http": {
      "commandName": "Project",
      "dotnetRunMessages": true,
      "launchBrowser": true,
      "applicationUrl": "http://localhost:15062",
      "environmentVariables": {
        ....        
        "EMBEDDING_MODEL": "nomic-embed-text",
        "EMBEDDING_MODEL_PROVIDER": "Ollama",        
        "CHAT_MODEL": "mistral",
        "CHAT_MODEL_PROVIDER": "Ollama",        
        "VECTOR_STORE_VECTOR_NAME": "page_content_vector",
        ...
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

In this project, we can use the following values for EMBEDDING_MODEL_PROVIDER:

  • Ollama : Spin up an Ollama Container using aspire and inject the connection string.
  • OllamaHost : Do not spin up Ollama container and inject host.docker.internal to containers or localhost to the application executables.
  • OpenAI: Inject the API key from secrets and use default OpenAI urls.
  • HuggingFace: Inject API key from developer secrets and use default HuggingFace inference urls.

To use OpenAI or huggingFace, the following user secrets need to be set with valid keys:
Please note, nothing Python and .Net components will use default endpoints for these services and therefore connection strings are not used.

{
  "Parameters:OpenAIKey": "",
  "Parameters:HuggingFaceKey": ""
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

This was a warm up to building a metadata driven retrieval system to query photographs using multi-modal computer vision models which I have been posting about.

Even with quantised and smaller models that run on CPU, we can get decent results asking about .Net Aspire based on the markdown and yml files in the official documentation repository. In this case, we did not utilise any metadata but that will be part of the photo search project.

Models used for testing

Embedding model: granite-embedding
Generative model: qwen2.5:1.5b

Here is a simple question with a relevant answer when using Rag query:

RAG Search

And a made up answer when the question is sent directly to the llm:

Search without context

Runtime performance

In addition, if using Ollama and working with a laptop that has some level of hardware acceleration but acceleration is not available in Docker, then using Ollama installed locally vs running as a container via Aspire gives much better runtime performance. Here is a comparison using a small model:

Running as a container

We can see that ingestion took 1 minute and running two questions took about 33 seconds.

Runtime performance when running Ollama in Docker

Running natively on host machine where host has acceleration

When running Ollama natively on the laptop and using host.docker.internal to connect to it from containers, we get around 15 seconds for ingestion and 4 seconds for two queries.

Runtime Performance when running Ollama natively on a laptop with acceleration

Next Steps

With the available technologies, we can rapidly build question and answer solutions. However if we don’t define our performance metrics and a suitable evaluation approach, there is little value in building such systems.

For instance, we can use different embedding models and generative models. We can change our chunking method or use additional metadata to query and extract relevant chunks more effectively. We can also change model parameters and the list goes on.

With so many variables, how do we compare the outcome? The next post on this topic will include the following:

  • Generating evaluation data using LLMs.
  • Defining our metrics.
  • Performing evaluation using the evaluation data and our target metrics.
  • Collecting the results from our experiments.
  • Visualising and comparing the performance of the evaluation process.

Links and References

Chunking

Tools and Frameworks

Top comments (0)