David Mezzetti for NeuML

Posted on Jun 21, 2024 • Edited on Jan 9 • Originally published at neuml.hashnode.dev

How RAG with txtai works

#ai #llm #rag #vectordatabase

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

Large Language Models (LLMs) have captured the public's attention with their impressive capabilities. The Generative AI era has reached a fever pitch with some predicting the coming rise of superintelligence.

LLMs are far from perfect though and we're still a ways away from true AI. The biggest challenge is with hallucinations. Hallucinations is the term for when a LLM generates output that is factually incorrect. The alarming part of this is that on a cursory glance, it actually sounds like factual content. The default behavior of LLMs is to produce plausible answers even when no plausible answer exists. LLMs are not great at saying I don't know.

Retrieval Augmented Generation (RAG) helps reduce the risk of hallucinations by limiting the context in which a LLM can generate answers. This is typically done with a search query that hydrates a prompt with a relevant context. RAG has been one of the most practical use cases of the Generative AI era.

txtai has a multiple ways to run RAG pipelines as follows.

Embeddings instance and LLM. Run the embeddings search and plug the search results into a LLM prompt.
RAG (aka Extractor) pipeline which automatically adds a search context to LLM prompts.
RAG FastAPI service with YAML

This article will cover all these methods and shows how RAG with txtai works.

Install dependencies

Install txtai and all dependencies.

pip install txtai[api,pipeline] autoawq

Components of a RAG pipeline

Before using txtai's RAG pipeline, we'll show how each of the underlying components work together. In this example, we'll load the txtai Wikipedia embeddings database and a LLM. From there, we'll run a RAG process.

from txtai import Embeddings, LLM

# Load Wikipedia Embeddings database
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

# Create LLM
llm = LLM("TheBloke/Mistral-7B-OpenOrca-AWQ")

Next, we'll create a prompt template to use for the RAG pipeline. The prompt has a placeholder for the question and context.

# Prompt template
prompt = """<|im_start|>system
You are a friendly assistant. You answer questions from users.<|im_end|>
<|im_start|>user
Answer the following question using only the context below. Only include information
specifically discussed.

question: {question}
context: {context} <|im_end|>
<|im_start|>assistant
"""

After that, we'll generate the context using an embeddings (aka vector) query. This query finds the top 3 most similar matches to the question "How do you make beer 🍺?"

question = "How do you make beer?"

# Generate context
context = "\n".join([x["text"] for x in embeddings.search(question)])
print(context)

Brewing is the production of beer by steeping a starch source (commonly cereal grains, the most popular of which is barley) in water and fermenting the resulting sweet liquid with yeast.  It may be done in a brewery by a commercial brewer, at home by a homebrewer, or communally. Brewing has taken place since around the 6th millennium BC, and archaeological evidence suggests that emerging civilizations, including ancient Egypt, China, and Mesopotamia, brewed beer. Since the nineteenth century the brewing industry has been part of most western economies.
Beer is produced through steeping a sugar source (commonly Malted cereal grains) in water and then fermenting with yeast. Brewing has taken place since around the 6th millennium BC, and archeological evidence suggests that this technique was used in ancient Egypt. Descriptions of various beer recipes can be found in Sumerian writings, some of the oldest known writing of any sort. Brewing is done in a brewery by a brewer, and the brewing industry is part of most western economies. In 19th century Britain, technological discoveries and improvements such as Burtonisation and the Burton Union system significantly changed beer brewing.
Craft beer is a beer that has been made by craft breweries, which typically produce smaller amounts of beer, than larger "macro" breweries, and are often independently owned. Such breweries are generally perceived and marketed as emphasising enthusiasm, new flavours, and varied brewing techniques.

Now we'll take the question and context and put that into the prompt.

print(llm(prompt.format(question=question, context=context)))

To make beer, you need to steep a starch source, such as malted cereal grains (commonly barley), in water. This process creates a sweet liquid called wort. Then, yeast is added to the wort, which ferments the liquid and produces alcohol and carbon dioxide. The beer is then aged, filtered, and packaged for consumption. This process has been used since around the 6th millennium BC and has been a part of most western economies since the 19th century.

Looking at the generated answer, we can see it's based on the context above. The LLM generates a paragraph of text using the context as input. While this same answer could be directly asked of the LLM, this helps ensure the answer is based on known factual data.

Before continuing, it's important to note that txtai has multiple ways to run LLM inference. In the past, prior to "Chat Templates", it was expected that the submitted text had all the required chat tokens embedded. The same prompt above can also be written with chat messages. This is especially important when working with LLM APIs (i.e. OpenAI, Claude, Bedrock etc).

llm([
    {"role": "system": "You are a friendly assistant. You answer questions from users."}
    {"role": "user", "content": f"""
        Answer the following question using only the context below. Only include information specifically discussed.

        question: {question}
        context: {text} 
    """}
])

See the LLM pipeline documentation for more information.

The RAG Pipeline

txtai has a RAG pipeline that makes this even easier. The logic to generate the context and join it context with the prompt is built in. Let's try that.

from txtai import RAG

# Create RAG pipeline using existing components. LLM parameter can also be a model path.
rag = RAG(embeddings, llm, template=prompt)

Let's ask a question similar to the last one. This time we'll ask "How do you make wine🍷?"

print(rag("How do you make wine?", maxlength=2048)["answer"])

To make wine, follow these steps:

1. Select the fruit: Choose high-quality grapes or other fruit for wine production.

2. Fermentation: Introduce yeast to the fruit, which will consume the sugar present in the juice and convert it into ethanol and carbon dioxide.

3. Monitor temperature and oxygen levels: Control the temperature and speed of fermentation, as well as the levels of oxygen present in the must at the start of fermentation.

4. Primary fermentation: This stage lasts from 5 to 14 days, during which the yeast consumes the sugar and produces alcohol and carbon dioxide.

5. Secondary fermentation (optional): If desired, allow the wine to undergo a secondary fermentation, which can last another 5 to 10 days.

6. Fermentation location: Choose the appropriate fermentation vessel, such as stainless steel tanks, open wooden vats, wine barrels, or wine bottles for sparkling wines.

7. Bottle and age the wine: Transfer the finished wine into bottles and allow it to age, if desired, to develop flavors and complexity.

Remember that wine can be made from various fruits, but grapes are most commonly used, and the term "wine" generally refers to grape wine when used without a qualifier.

As with the LLM pipeline, the RAG pipeline also supports chat messages. See the RAG pipeline documentation for more.

RAG API Endpoint

Did you know that txtai has a built-in framework for automatically generating FastAPI services? This can be done with a YAML configuration file.

# config.yml
# Load Wikipedia Embeddings index
cloud:
  provider: huggingface-hub
  container: neuml/txtai-wikipedia

# RAG pipeline configuration
rag:
  path: TheBloke/Mistral-7B-OpenOrca-AWQ
  output: flatten
  template: |
    <|im_start|>system
    You are a friendly assistant. You answer questions from users.<|im_end|>
    <|im_start|>user
    Answer the following question using only the context below. Only include information
    specifically discussed.

    question: {question}
    context: {context} <|im_end|>
    <|im_start|>assistant

Note how the same prompt template and models are set. This time instead of doing that with Python, it's done with a YAML configuration file 🔥

Now let's start the API service using this configuration.

CONFIG=config.yml nohup uvicorn "txtai.api:app" &> api.log &
sleep 90

Now let's run a RAG query using the API service. Keeping with the theme, we'll ask "How do you make whisky 🥃?"

curl "http://localhost:8000/rag?query=how+do+you+make+whisky&maxlength=2048"

To make whisky, follow these steps:

1. Choose the grains: Select the grains you want to use for your whisky, such as barley, corn, rye, or wheat.

2. Malt the grains (optional): If using barley, malt the grains by soaking them in water and allowing them to germinate. This process releases enzymes that help break down starches into fermentable sugars.

3. Mill the grains: Grind the grains to create a coarse flour, which will be mixed with water to create a mash.

4. Create the mash: Combine the milled grains with hot water in a large vessel, and let it sit for several hours to allow fermentation to occur. The mash should have a temperature of around 65°C (149°F) to encourage the growth of yeast.

5. Add yeast: Once the mash has cooled to around 30°C (86°F), add yeast to the mixture. The yeast will ferment the sugars in the mash, producing alcohol.

6. Fermentation: Allow the mixture to ferment for several days, during which the yeast will consume the sugars and produce alcohol and carbon dioxide.

7. Distillation: Transfer the fermented liquid, called "wash" to a copper still. Heat the wash in the still, and the alcohol will vaporize and rise through the still's neck. The vapors are then condensed back into a liquid form, creating a high-proof spirit.

8. Maturation: Transfer the distilled spirit to wooden casks, typically made of charred white oak. The spirit will mature in the casks for a specified period, usually ranging from 3 to 25 years. During this time, the wood imparts flavors and color to the whisky.

9. Bottling: Once the whisky has reached the desired maturity, it is bottled and ready for consumption.

And as before, we get an answer bound by the search context provided to the LLM. This time it comes from an API service vs a direct Python method.

RAG API Service with Docker

txtai builds Docker images with each release. There are also Docker files available to help configure API services.

The Dockerfile below builds an API service using the same config.yml.

# Set base image
ARG BASE_IMAGE=neuml/txtai-gpu
FROM $BASE_IMAGE

# Copy configuration
COPY config.yml .

# Install latest version of txtai from GitHub
RUN \
    apt-get update && \
    apt-get -y --no-install-recommends install git && \
    rm -rf /var/lib/apt/lists && \
    python -m pip install git+https://github.com/neuml/txtai

# Run local API instance to cache models in container
RUN python -c "from txtai.api import API; API('config.yml')"

# Start server and listen on all interfaces
ENV CONFIG "config.yml"
ENTRYPOINT ["uvicorn", "--host", "0.0.0.0", "txtai.api:app"]

The following commands build and start a Docker API service.

docker build -t txtai-wikipedia --build-arg BASE_IMAGE=neuml/txtai-gpu .
docker run -d --gpus=all -it -p 8000:8000 txtai-wikipedia

This creates the same API service just this time it's through Docker. RAG queries can be run the same way.

curl "http://localhost:8000/rag?query=how+do+you+make+whisky&maxlength=2048"

Wrapping up

This article covered the various ways to run retrieval augmented generation (RAG) with txtai. We hope you find txtai is one of the easiest and most flexible ways to get up and running fast!