David Mezzetti for NeuML

Posted on Feb 11 • Originally published at neuml.hashnode.dev

Chunking your data for RAG

#ai #llm #rag #vectordatabase

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

One of the major workflows in txtai is Retrieval Augmented Generation (RAG). Large Language Models (LLM) are built to generate coherent sounding text. While in many cases it is factually accurate, that is not what they're built to do. RAG steps in to help inject smaller pieces of knowledge into a LLM prompt and increase the overall accuracy of responses. The R in RAG is very important.

This article will demonstrate how to extract, chunk and index text to support retrieval operations for RAG.

Install dependencies

Install txtai and all dependencies.

pip install txtai[pipeline-text]

Data chunking and indexing

Let's dive right in and keep this example simple. The next section creates a Textractor pipeline and an Embeddings database.

The Textractor extracts chunks of text from files and the Embeddings takes those chunks and builds an index/database. We'll use a late chunker backed by Chonkie.

Then, we'll build an indexing workflow that streams chunks from two files.

from txtai import Embeddings
from txtai.pipeline import Textractor

# Text extraction pipeline with late chunking via Chonkie
textractor = Textractor(chunker="late")
embeddings = Embeddings(content=True)

def stream():
    urls = ["https://github.com/neuml/txtai", "https://arxiv.org/pdf/2005.11401"]
    for x, url in enumerate(urls):
        chunks = textractor(url)

        # Add all chunks - use the same document id for each chunk
        for chunk in chunks:
            yield x, chunk

        # Add the document metadata with the same document id
        # Can be any metadata. Can also be the entire document.
        yield x, {"url": url}

# Index the chunks and metadata
embeddings.index(stream())

A key element of txtai that is commonly misunderstood is how to best store chunks of data and join them back to the main document. txtai allows re-using the same logical id multiple times.

Behind the scenes, each chunk gets it's own unique index id. The backend database stores chunks in a table called sections and data in a table called documents. This has been the case as far back as txtai 4.0. txtai also has the ability to store associated binary data in a table called objects. It's important to note that each associated document or object is only stored once.

To illustrate, let's look at the first 20 rows in the embeddings database created.

for x in embeddings.search("SELECT indexid, id, url, text from txtai", 20):
    print(x)

{'indexid': 0, 'id': '0', 'url': 'https://github.com/neuml/txtai', 'text': '**GitHub - neuml/txtai: 💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows**\n\n*💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows - neuml/txtai*\n\n\n\n**All-in-one embeddings database** \ntxtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.\n\nEmbeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases.\n\nThis foundation enables vector search and/or serves as a powerful knowledge source for large language model (LLM) applications.\n\nBuild autonomous agents, retrieval augmented generation (RAG) processes, multi-model workflows and more.\n\nSummary of txtai features:\n\n- 🔎 Vector search with SQL, object storage, topic modeling, graph analysis and multimodal indexing\n- 📄 Create embeddings for text, documents, audio, images and video\n- 💡 Pipelines powered by language models that run LLM prompts, question-answering, labeling, transcription, translation, summarization and more\n- ↪️️ Workflows to join pipelines together and aggregate business logic. txtai processes can be simple microservices or multi-model workflows.\n- 🤖 Agents that intelligently connect embeddings, pipelines, workflows and other agents together to autonomously solve complex problems\n- ⚙️ Build with Python or YAML. API bindings available for [JavaScript](https://github.com/neuml/txtai.js) , [Java](https://github.com/neuml/txtai.java) , [Rust](https://github.com/neuml/txtai.rs) and [Go](https://github.com/neuml/txtai.go) .\n- 🔋 Batteries included with defaults to get up and running fast\n- ☁️ Run local or scale out with container orchestration\ntxtai is built with Python 3.9+, [Hugging Face Transformers](https://github.com/huggingface/transformers) , [Sentence Transformers](https://github.'}
{'indexid': 1, 'id': '0', 'url': 'https://github.com/neuml/txtai', 'text': 'com/UKPLab/sentence-transformers) and [FastAPI](https://github.com/tiangolo/fastapi) . txtai is open-source under an Apache 2.0 license.\n\n*Interested in an easy and secure way to run hosted txtai applications? Then join the* [txtai.cloud](https://txtai.cloud) *preview to learn more.* \n\n## Why txtai?\nNew vector databases, LLM frameworks and everything in between are sprouting up daily. Why build with txtai?\n\n- Up and running in minutes with [pip](https://neuml.github.io/txtai/install/) or [Docker](https://neuml.github.io/txtai/cloud/) \n```

\n# Get started in a couple lines\nimport txtai\n\nembeddings = txtai.Embeddings()\nembeddings.index(["Correct", "Not what we hoped"])\nembeddings.search("positive", 1)\n#[(0, 0.29862046241760254)]\n

```\n\n- Built-in API makes it easy to develop applications using your programming language of choice\n```

\n# app.yml\nembeddings:\n path: sentence-transformers/all-MiniLM-L6-v2\n

```\n\n```

\nCONFIG=app.yml uvicorn "txtai.api:app"\ncurl -X GET "http://localhost:8000/search?query=positive"\n

```\n\n- Run local - no need to ship data off to disparate remote services\n- Work with micromodels all the way up to large language models (LLMs)\n- Low footprint - install additional dependencies and scale up when needed\n- [Learn by example](https://neuml.github.io/txtai/examples) - notebooks cover all available functionality\n\n## Use Cases\nThe following sections introduce common txtai use cases. A comprehensive set of over 60 [example notebooks and applications](https://neuml.github.io/txtai/examples) are also available.\n\n\n### Semantic Search\nBuild semantic/similarity/vector/neural search applications.'}
{'indexid': 2, 'id': '0', 'url': 'https://github.com/neuml/txtai', 'text': 'Traditional search systems use keywords to find data. Semantic search has an understanding of natural language and identifies results that have the same meaning, not necessarily the same keywords.\n\nGet started with the following examples.\n\n|Notebook|Description||\n|---|---|---|\n|[Introducing txtai](https://github.com/neuml/txtai/blob/master/examples/01_Introducing_txtai.ipynb) |Overview of the functionality provided by txtai||\n|[Similarity search with images](https://github.com/neuml/txtai/blob/master/examples/13_Similarity_search_with_images.ipynb) |Embed images and text into the same space for search||\n|[Build a QA database](https://github.com/neuml/txtai/blob/master/examples/34_Build_a_QA_database.ipynb) |Question matching with semantic search||\n|[Semantic Graphs](https://github.com/neuml/txtai/blob/master/examples/38_Introducing_the_Semantic_Graph.ipynb) |Explore topics, data connectivity and run network analysis||\n\n### LLM Orchestration\nAutonomous agents, retrieval augmented generation (RAG), chat with your data, pipelines and workflows that interface with large language models (LLMs).\n\nSee below to learn more.\n\n|Notebook|Description||\n|---|---|---|\n|[Prompt templates and task chains](https://github.com/neuml/txtai/blob/master/examples/44_Prompt_templates_and_task_chains.ipynb) |Build model prompts and connect tasks together with workflows||\n|[Integrate LLM frameworks](https://github.com/neuml/txtai/blob/master/examples/53_Integrate_LLM_Frameworks.ipynb) |Integrate llama.cpp, LiteLLM and custom generation frameworks||\n|[Build knowledge graphs with LLMs](https://github.'}
{'indexid': 3, 'id': '0', 'url': 'https://github.com/neuml/txtai', 'text': 'com/neuml/txtai/blob/master/examples/57_Build_knowledge_graphs_with_LLM_driven_entity_extraction.ipynb) |Build knowledge graphs with LLM-driven entity extraction||\n\n#### Agents\nAgents connect embeddings, pipelines, workflows and other agents together to autonomously solve complex problems.\n\ntxtai agents are built on top of the Transformers Agent framework. This supports all LLMs txtai supports (Hugging Face, llama.cpp, OpenAI / Claude / AWS Bedrock via LiteLLM).\n\nSee the link below to learn more.\n\n|Notebook|Description||\n|---|---|---|\n|[Analyzing Hugging Face Posts with Graphs and Agents](https://github.com/neuml/txtai/blob/master/examples/68_Analyzing_Hugging_Face_Posts_with_Graphs_and_Agents.ipynb) |Explore a rich dataset with Graph Analysis and Agents||\n|[Granting autonomy to agents](https://github.com/neuml/txtai/blob/master/examples/69_Granting_autonomy_to_agents.ipynb) |Agents that iteratively solve problems as they see fit||\n|[Analyzing LinkedIn Company Posts with Graphs and Agents](https://github.com/neuml/txtai/blob/master/examples/71_Analyzing_LinkedIn_Company_Posts_with_Graphs_and_Agents.ipynb) |Exploring how to improve social media engagement with AI||\n\n#### Retrieval augmented generation\nRetrieval augmented generation (RAG) reduces the risk of LLM hallucinations by constraining the output with a knowledge base as context. RAG is commonly used to "chat with your data".\n\nA novel feature of txtai is that it can provide both an answer and source citation.\n\n|Notebook|Description||\n|---|---|---|\n|[Build RAG pipelines with txtai](https://github.'}
{'indexid': 4, 'id': '0', 'url': 'https://github.com/neuml/txtai', 'text': 'com/neuml/txtai/blob/master/examples/52_Build_RAG_pipelines_with_txtai.ipynb) |Guide on retrieval augmented generation including how to create citations||\n|[How RAG with txtai works](https://github.com/neuml/txtai/blob/master/examples/63_How_RAG_with_txtai_works.ipynb) |Create RAG processes, API services and Docker instances||\n|[Advanced RAG with graph path traversal](https://github.com/neuml/txtai/blob/master/examples/58_Advanced_RAG_with_graph_path_traversal.ipynb) |Graph path traversal to collect complex sets of data for advanced RAG||\n|[Speech to Speech RAG](https://github.com/neuml/txtai/blob/master/examples/65_Speech_to_Speech_RAG.ipynb) |Full cycle speech to speech workflow with RAG||\n\n### Language Model Workflows\nLanguage model workflows, also known as semantic workflows, connect language models together to build intelligent applications.\n\nWhile LLMs are powerful, there are plenty of smaller, more specialized models that work better and faster for specific tasks. This includes models for extractive question-answering, automatic summarization, text-to-speech, transcription and translation.\n\n|Notebook|Description||\n|---|---|---|\n|[Run pipeline workflows](https://github.com/neuml/txtai/blob/master/examples/14_Run_pipeline_workflows.ipynb) |Simple yet powerful constructs to efficiently process data||\n|[Building abstractive text summaries](https://github.com/neuml/txtai/blob/master/examples/09_Building_abstractive_text_summaries.ipynb) |Run abstractive text summarization||\n|[Transcribe audio to text](https://github.com/neuml/txtai/blob/master/examples/11_Transcribe_audio_to_text.'}
{'indexid': 5, 'id': '0', 'url': 'https://github.com/neuml/txtai', 'text': 'ipynb) |Convert audio files to text||\n|[Translate text between languages](https://github.com/neuml/txtai/blob/master/examples/12_Translate_text_between_languages.ipynb) |Streamline machine translation and language detection||\n\n## Installation\nThe easiest way to install is via pip and PyPI\n\n```

\npip install txtai\n

```\n\nPython 3.9+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.\n\nSee the detailed [install instructions](https://neuml.github.io/txtai/install) for more information covering [optional dependencies](https://neuml.github.io/txtai/install/#optional-dependencies) , [environment specific prerequisites](https://neuml.github.io/txtai/install/#environment-specific-prerequisites) , [installing from source](https://neuml.github.io/txtai/install/#install-from-source) , [conda support](https://neuml.github.io/txtai/install/#conda) and how to [run with containers](https://neuml.github.io/txtai/cloud) .\n\n\n## Model guide\nSee the table below for the current recommended models. These models all allow commercial use and offer a blend of speed and performance.\n\n|Component|Model(s)|\n|---|---|\n|[Embeddings](https://neuml.github.io/txtai/embeddings) |[all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) |\n|[Image Captions](https://neuml.github.io/txtai/pipeline/image/caption) |[BLIP](https://hf.co/Salesforce/blip-image-captioning-base) |'}
{'indexid': 6, 'id': '0', 'url': 'https://github.com/neuml/txtai', 'text': '|[Labels - Zero Shot](https://neuml.github.io/txtai/pipeline/text/labels) |[BART-Large-MNLI](https://hf.co/facebook/bart-large) |\n|[Labels - Fixed](https://neuml.github.io/txtai/pipeline/text/labels) |Fine-tune with [training pipeline](https://neuml.github.io/txtai/pipeline/train/trainer) |\n|[Large Language Model (LLM)](https://neuml.github.io/txtai/pipeline/text/llm) |[Llama 3.1 Instruct](https://hf.co/meta-llama/Llama-3.1-8B-Instruct) |\n|[Summarization](https://neuml.github.io/txtai/pipeline/text/summary) |[DistilBART](https://hf.co/sshleifer/distilbart-cnn-12-6) |\n|[Text-to-Speech](https://neuml.github.io/txtai/pipeline/audio/texttospeech) |[ESPnet JETS](https://hf.co/NeuML/ljspeech-jets-onnx) |\n|[Transcription](https://neuml.github.io/txtai/pipeline/audio/transcription) |[Whisper](https://hf.co/openai/whisper-base) |\n|[Translation](https://neuml.github.io/txtai/pipeline/text/translation) |[OPUS Model Series](https://hf.co/Helsinki-NLP) |\nModels can be loaded as either a path from the Hugging Face Hub or a local directory. Model paths are optional, defaults are loaded when not specified. For tasks with no recommended model, txtai uses the default models as shown in the Hugging Face Tasks guide.\n\nSee the following links to learn more.\n\n\n## Powered by txtai\nThe following applications are powered by txtai.'}
{'indexid': 7, 'id': '0', 'url': 'https://github.com/neuml/txtai', 'text': "|Application|Description|\n|---|---|\n|[rag](https://github.com/neuml/rag) |Retrieval Augmented Generation (RAG) application|\n|[ragdata](https://github.com/neuml/ragdata) |Build knowledge bases for RAG|\n|[paperai](https://github.com/neuml/paperai) |Semantic search and workflows for medical/scientific papers|\n|[annotateai](https://github.com/neuml/annotateai) |Automatically annotate papers with LLMs|\nIn addition to this list, there are also many other [open-source projects](https://github.com/neuml/txtai/network/dependents) , [published research](https://scholar.google.com/scholar?q=txtai&hl=en&as_ylo=2022) and closed proprietary/commercial projects that have built on txtai in production.\n\n\n## Further Reading\n- [Tutorial series on Hashnode](https://neuml.hashnode.dev/series/txtai-tutorial) | [dev.to](https://dev.to/neuml/tutorial-series-on-txtai-ibg) \n- [What's new in txtai 8.0](https://medium.com/neuml/whats-new-in-txtai-8-0-2d7d0ab4506b) | [7.0](https://medium.com/neuml/whats-new-in-txtai-7-0-855ad6a55440) | [6.0](https://medium.com/neuml/whats-new-in-txtai-6-0-7d93eeedf804) | [5.0](https://medium.com/neuml/whats-new-in-txtai-5-0-e5c75a13b101) | [4.0](https://medium."}
{'indexid': 8, 'id': '0', 'url': 'https://github.com/neuml/txtai', 'text': 'com/neuml/whats-new-in-txtai-4-0-bbc3a65c3d1c) \n- [Getting started with semantic search](https://medium.com/neuml/getting-started-with-semantic-search-a9fd9d8a48cf) | [workflows](https://medium.com/neuml/getting-started-with-semantic-workflows-2fefda6165d9) | [rag](https://medium.com/neuml/getting-started-with-rag-9a0cca75f748) \n\n## Documentation\n[Full documentation on txtai](https://neuml.github.io/txtai) including configuration settings for embeddings, pipelines, workflows, API and a FAQ with common questions/issues is available.\n\n\n## Contributing\nFor those who would like to contribute to txtai, please see [this guide](https://github.com/neuml/.github/blob/master/CONTRIBUTING.md) .'}
{'indexid': 9, 'id': '1', 'url': 'https://arxiv.org/pdf/2005.11401', 'text': 'Retrieval-Augmented Generation for\nKnowledge-Intensive NLP Tasks\n\nPatrick Lewis†‡, Ethan Perez?,\n\nAleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,\n\nMike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†\n\n†Facebook AI Research; ‡University College London; ?New York University;\nplewis@fb.com\n\nAbstract\n\nLarge pre-trained language models have been shown to store factual knowledge\nin their parameters, and achieve state-of-the-art results when fine-tuned on down-\nstream NLP tasks. However, their ability to access and precisely manipulate knowl-\nedge is still limited, and hence on knowledge-intensive tasks, their performance\nlags behind task-specific architectures. Additionally, providing provenance for their\ndecisions and updating their world knowledge remain open research problems. Pre-\ntrained models with a differentiable access mechanism to explicit non-parametric\nmemory have so far been only investigated for extractive downstream tasks. We\nexplore a general-purpose fine-tuning recipe for retrieval-augmented generation\n(RAG) — models which combine pre-trained parametric and non-parametric mem-\nory for language generation. We introduce RAG models where the parametric\nmemory is a pre-trained seq2seq model and the non-parametric memory is a dense\nvector index of Wikipedia, accessed with a pre-trained neural retriever. We com-\npare two RAG formulations, one which conditions on the same retrieved passages\nacross the whole generated sequence, and another which can use different passages\nper token. We fine-tune and evaluate our models on a wide range of knowledge-\nintensive NLP tasks and set the state of the art on three open domain QA tasks,\noutperforming parametric seq2seq models and task-specific retrieve-and-extract\narchitectures. For language generation tasks, we find that RAG models generate\nmore specific, diverse and factual language than a state-of-the-art parametric-only\nseq2seq baseline.\n\n1 Introduction\n\nPre-trained neural language models have been shown to learn a substantial amount of in-depth knowl-\nedge from data [47]. They can do so without any access to an external memory, as a parameterized\nimplicit knowledge base [51, 52].'}
{'indexid': 10, 'id': '1', 'url': 'https://arxiv.org/pdf/2005.11401', 'text': 'While this development is exciting, such models do have down-\nsides: They cannot easily expand or revise their memory, can’t straightforwardly provide insight into\ntheir predictions, and may produce “hallucinations” [38]. Hybrid models that combine parametric\nmemory with non-parametric (i.e., retrieval-based) memories [20, 26, 48] can address some of these\nissues because knowledge can be directly revised and expanded, and accessed knowledge can be\ninspected and interpreted. REALM [20] and ORQA [31], two recently introduced models that\ncombine masked language models [8] with a differentiable retriever, have shown promising results,\n\nar\nX\n\niv\n:2\n\n00\n5.\n\n11\n40\n\n1v\n4 \n\n [\ncs\n\n.C\nL\n\n] \n 1\n\n2 \nA\n\npr\n 2\n\n02\n1\n\n\nThe\tDivine\nComedy\t(x) q\n\nQuery\nEncoder\n\nq(x)\n\nMIPS pθ\n\nGenerator\xa0pθ\n(Parametric)\n\nMargin-\nalize\n\nThis\t14th\tcentury\twork\nis\tdivided\tinto\t3\nsections:\t"Inferno",\n"Purgatorio"\t&\n"Paradiso"\t\t\t\t\t\t\t\t\t(y)\n\nEnd-to-End Backprop through q and\xa0pθ\n\nBarack\tObama\twas\nborn\tin\tHawaii.(x)\n\nFact Verification: Fact Query\n\nsupports\t(y)\n\nQuestion Generation\n\nFact Verification:\nLabel Generation\n\nDocument\nIndex\n\nDefine\t"middle\tear"(x)\n\nQuestion Answering:\nQuestion Query\n\nThe\tmiddle\tear\tincludes\nthe\ttympanic\tcavity\tand\nthe\tthree\tossicles.\t\t(y)\n\nQuestion Answering:\nAnswer GenerationRetriever pη\n\n(Non-Parametric)\nz4\n\nz3\nz2\n\nz1\n\nd(z)\n\nJeopardy Question\nGeneration:\n\nAnswer Query\n\nFigure 1: Overview of our approach. We combine a pre-trained retriever (Query Encoder + Document\nIndex) with a pre-trained seq2seq model (Generator) and fine-tune end-to-end. For query x, we use\nMaximum Inner Product Search (MIPS) to find the top-K documents zi. For final prediction y, we\ntreat z as a latent variable and marginalize over seq2seq predictions given different documents.\n\nbut have only explored open-domain extractive question answering. Here, we bring hybrid parametric\nand non-parametric memory to the “workhorse of NLP,” i.e. sequence-to-sequence (seq2seq) models.\n\nWe endow pre-trained, parametric-memory generation models with a non-parametric memory through'}
{'indexid': 11, 'id': '1', 'url': 'https://arxiv.org/pdf/2005.11401', 'text': 'a general-purpose fine-tuning approach which we refer to as retrieval-augmented generation (RAG).\nWe build RAG models where the parametric memory is a pre-trained seq2seq transformer, and the\nnon-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural\nretriever. We combine these components in a probabilistic model trained end-to-end (Fig. 1). The\nretriever (Dense Passage Retriever [26], henceforth DPR) provides latent documents conditioned on\nthe input, and the seq2seq model (BART [32]) then conditions on these latent documents together with\nthe input to generate the output. We marginalize the latent documents with a top-K approximation,\neither on a per-output basis (assuming the same document is responsible for all tokens) or a per-token\nbasis (where different documents are responsible for different tokens). Like T5 [51] or BART, RAG\ncan be fine-tuned on any seq2seq task, whereby both the generator and retriever are jointly learned.\n\nThere has been extensive previous work proposing architectures to enrich systems with non-parametric\nmemory which are trained from scratch for specific tasks, e.g. memory networks [64, 55], stack-\naugmented networks [25] and memory layers [30]. In contrast, we explore a setting where both\nparametric and non-parametric memory components are pre-trained and pre-loaded with extensive\nknowledge. Crucially, by using pre-trained access mechanisms, the ability to access knowledge is\npresent without additional training.\n\nOur results highlight the benefits of combining parametric and non-parametric memory with genera-\ntion for knowledge-intensive tasks—tasks that humans could not reasonably be expected to perform\nwithout access to an external knowledge source. Our RAG models achieve state-of-the-art results\non open Natural Questions [29], WebQuestions [3] and CuratedTrec [2] and strongly outperform\nrecent approaches that use specialised pre-training objectives on TriviaQA [24]. Despite these being'}
{'indexid': 12, 'id': '1', 'url': 'https://arxiv.org/pdf/2005.11401', 'text': 'extractive tasks, we find that unconstrained generation outperforms previous extractive approaches.\nFor knowledge-intensive generation, we experiment with MS-MARCO [1] and Jeopardy question\ngeneration, and we find that our models generate responses that are more factual, specific, and\ndiverse than a BART baseline. For FEVER [56] fact verification, we achieve results within 4.3% of\nstate-of-the-art pipeline models which use strong retrieval supervision. Finally, we demonstrate that\nthe non-parametric memory can be replaced to update the models’ knowledge as the world changes.1\n\n2 Methods\n\nWe explore RAG models, which use the input sequence x to retrieve text documents z and use them\nas additional context when generating the target sequence y. As shown in Figure 1, our models\nleverage two components: (i) a retriever pη(z|x) with parameters η that returns (top-K truncated)\ndistributions over text passages given a query x and (ii) a generator pθ(yi|x, z, y1:i−1) parametrized\n\n1Code to run experiments with RAG has been open-sourced as part of the HuggingFace Transform-\ners Library [66] and can be found at https://github.com/huggingface/transformers/blob/master/\nexamples/rag/. An interactive demo of RAG models can be found at https://huggingface.co/rag/\n\n2\n\n[https://github.com/huggingface/transformers/blob/master/examples/rag/](https://github.com/huggingface/transformers/blob/master/examples/rag/) \n[https://github.com/huggingface/transformers/blob/master/examples/rag/](https://github.com/huggingface/transformers/blob/master/examples/rag/) \n[https://huggingface.co/rag/](https://huggingface.co/rag/) \n\nby θ that generates a current token based on a context of the previous i− 1 tokens y1:i−1, the original\ninput x and a retrieved passage z.\n\nTo train the retriever and generator end-to-end, we treat the retrieved document as a latent variable.\nWe propose two models that marginalize over the latent documents in different ways to produce a'}
{'indexid': 13, 'id': '1', 'url': 'https://arxiv.org/pdf/2005.11401', 'text': 'distribution over generated text. In one approach, RAG-Sequence, the model uses the same document\nto predict each target token. The second approach, RAG-Token, can predict each target token based\non a different document. In the following, we formally introduce both models and then describe the\npη and pθ components, as well as the training and decoding procedure.\n\n2.1 Models\n\nRAG-Sequence Model The RAG-Sequence model uses the same retrieved document to generate\nthe complete sequence. Technically, it treats the retrieved document as a single latent variable that\nis marginalized to get the seq2seq probability p(y|x) via a top-K approximation. Concretely, the\ntop K documents are retrieved using the retriever, and the generator produces the output sequence\nprobability for each document, which are then marginalized,\n\npRAG-Sequence(y|x) ≈\n∑\n\nz∈top-k(p(·|x))\n\npη(z|x)pθ(y|x, z) =\n∑\n\nz∈top-k(p(·|x))\n\npη(z|x)\nN∏\ni\n\npθ(yi|x, z, y1:i−1)\n\nRAG-Token Model In the RAG-Token model we can draw a different latent document for each\ntarget token and marginalize accordingly. This allows the generator to choose content from several\ndocuments when producing an answer. Concretely, the top K documents are retrieved using the\nretriever, and then the generator produces a distribution for the next output token for each document,\nbefore marginalizing, and repeating the process with the following output token, Formally, we define:\n\npRAG-Token(y|x) ≈\nN∏\ni\n\n∑\nz∈top-k(p(·|x))\n\npη(z|x)pθ(yi|x, z, y1:i−1)\n\nFinally, we note that RAG can be used for sequence classification tasks by considering the target class\nas a target sequence of length one, in which case RAG-Sequence and RAG-Token are equivalent.\n\n2.2 Retriever: DPR\n\nThe retrieval component pη(z|x) is based on DPR [26]. DPR follows a bi-encoder architecture:\n\npη(z|x) ∝ exp\n(\nd(z)>q(x)\n\n)\nd(z) = BERTd(z), q(x) = BERTq(x)'}
{'indexid': 14, 'id': '1', 'url': 'https://arxiv.org/pdf/2005.11401', 'text': 'where d(z) is a dense representation of a document produced by a BERTBASE document encoder [8],\nand q(x) a query representation produced by a query encoder, also based on BERTBASE. Calculating\ntop-k(pη(·|x)), the list of k documents z with highest prior probability pη(z|x), is a Maximum Inner\nProduct Search (MIPS) problem, which can be approximately solved in sub-linear time [23]. We use\na pre-trained bi-encoder from DPR to initialize our retriever and to build the document index. This\nretriever was trained to retrieve documents which contain answers to TriviaQA [24] questions and\nNatural Questions [29]. We refer to the document index as the non-parametric memory.\n\n2.3 Generator: BART\n\nThe generator component pθ(yi|x, z, y1:i−1) could be modelled using any encoder-decoder. We use\nBART-large [32], a pre-trained seq2seq transformer [58] with 400M parameters. To combine the input\nx with the retrieved content z when generating from BART, we simply concatenate them. BART was\npre-trained using a denoising objective and a variety of different noising functions. It has obtained\nstate-of-the-art results on a diverse set of generation tasks and outperforms comparably-sized T5\nmodels [32]. We refer to the BART generator parameters θ as the parametric memory henceforth.\n\n2.4 Training\n\nWe jointly train the retriever and generator components without any direct supervision on what\ndocument should be retrieved. Given a fine-tuning training corpus of input/output pairs (xj , yj), we\n\n3\n\n\nminimize the negative marginal log-likelihood of each target,\n∑\nj − log p(yj |xj) using stochastic\n\ngradient descent with Adam [28]. Updating the document encoder BERTd during training is costly as\nit requires the document index to be periodically updated as REALM does during pre-training [20].'}
{'indexid': 15, 'id': '1', 'url': 'https://arxiv.org/pdf/2005.11401', 'text': 'We do not find this step necessary for strong performance, and keep the document encoder (and\nindex) fixed, only fine-tuning the query encoder BERTq and the BART generator.\n\n2.5 Decoding\n\nAt test time, RAG-Sequence and RAG-Token require different ways to approximate argmaxy p(y|x).\n\nRAG-Token The RAG-Token model can be seen as a standard, autoregressive seq2seq genera-\ntor with transition probability: p′θ(yi|x, y1:i−1) =\n\n∑\nz∈top-k(p(·|x)) pη(zi|x)pθ(yi|x, zi, y1:i−1) To\n\ndecode, we can plug p′θ(yi|x, y1:i−1) into a standard beam decoder.\n\nRAG-Sequence For RAG-Sequence, the likelihood p(y|x) does not break into a conventional per-\ntoken likelihood, hence we cannot solve it with a single beam search. Instead, we run beam search for\neach document z, scoring each hypothesis using pθ(yi|x, z, y1:i−1). This yields a set of hypotheses\nY , some of which may not have appeared in the beams of all documents. To estimate the probability\nof an hypothesis y we run an additional forward pass for each document z for which y does not\nappear in the beam, multiply generator probability with pη(z|x) and then sum the probabilities across\nbeams for the marginals. We refer to this decoding procedure as “Thorough Decoding.” For longer\noutput sequences, |Y | can become large, requiring many forward passes. For more efficient decoding,\nwe can make a further approximation that pθ(y|x, zi) ≈ 0 where y was not generated during beam\nsearch from x, zi. This avoids the need to run additional forward passes once the candidate set Y has\nbeen generated. We refer to this decoding procedure as “Fast Decoding.”\n\n3 Experiments\n\nWe experiment with RAG in a wide range of knowledge-intensive tasks. For all experiments, we use\na single Wikipedia dump for our non-parametric knowledge source. Following Lee et al. [31] and\nKarpukhin et al. [26], we use the December 2018 dump. Each Wikipedia article is split into disjoint'}
{'indexid': 16, 'id': '1', 'url': 'https://arxiv.org/pdf/2005.11401', 'text': '100-word chunks, to make a total of 21M documents. We use the document encoder to compute an\nembedding for each document, and build a single MIPS index using FAISS [23] with a Hierarchical\nNavigable Small World approximation for fast retrieval [37]. During training, we retrieve the top\nk documents for each query. We consider k ∈ {5, 10} for training and set k for test time using dev\ndata. We now discuss experimental details for each task.\n\n3.1 Open-domain Question Answering\n\nOpen-domain question answering (QA) is an important real-world application and common testbed\nfor knowledge-intensive tasks [20]. We treat questions and answers as input-output text pairs (x, y)\nand train RAG by directly minimizing the negative log-likelihood of answers. We compare RAG to\nthe popular extractive QA paradigm [5, 7, 31, 26], where answers are extracted spans from retrieved\ndocuments, relying primarily on non-parametric knowledge. We also compare to “Closed-Book\nQA” approaches [52], which, like RAG, generate answers, but which do not exploit retrieval, instead\nrelying purely on parametric knowledge. We consider four popular open-domain QA datasets: Natural\nQuestions (NQ) [29], TriviaQA (TQA) [24]. WebQuestions (WQ) [3] and CuratedTrec (CT) [2]. As\nCT and WQ are small, we follow DPR [26] by initializing CT and WQ models with our NQ RAG\nmodel. We use the same train/dev/test splits as prior work [31, 26] and report Exact Match (EM)\nscores. For TQA, to compare with T5 [52], we also evaluate on the TQA Wiki test set.\n\n3.2 Abstractive Question Answering\n\nRAG models can go beyond simple extractive QA and answer questions with free-form, abstractive\ntext generation. To test RAG’s natural language generation (NLG) in a knowledge-intensive setting,\nwe use the MSMARCO NLG task v2.1 [43]. The task consists of questions, ten gold passages'}
{'indexid': 17, 'id': '1', 'url': 'https://arxiv.org/pdf/2005.11401', 'text': 'retrieved from a search engine for each question, and a full sentence answer annotated from the\nretrieved passages. We do not use the supplied passages, only the questions and answers, to treat\n\n4\n\n\nMSMARCO as an open-domain abstractive QA task. MSMARCO has some questions that cannot be\nanswered in a way that matches the reference answer without access to the gold passages, such as\n“What is the weather in Volcano, CA?” so performance will be lower without using gold passages.\nWe also note that some MSMARCO questions cannot be answered using Wikipedia alone. Here,\nRAG can rely on parametric knowledge to generate reasonable responses.\n\n3.3 Jeopardy Question Generation\n\nTo evaluate RAG’s generation abilities in a non-QA setting, we study open-domain question gen-\neration. Rather than use questions from standard open-domain QA tasks, which typically consist\nof short, simple questions, we propose the more demanding task of generating Jeopardy questions.\nJeopardy is an unusual format that consists of trying to guess an entity from a fact about that entity.\nFor example, “The World Cup” is the answer to the question “In 1986 Mexico scored as the first\ncountry to host this international sports competition twice.” As Jeopardy questions are precise,\nfactual statements, generating Jeopardy questions conditioned on their answer entities constitutes a\nchallenging knowledge-intensive generation task.\n\nWe use the splits from SearchQA [10], with 100K train, 14K dev, and 27K test examples. As\nthis is a new task, we train a BART model for comparison. Following [67], we evaluate using the\nSQuAD-tuned Q-BLEU-1 metric [42]. Q-BLEU is a variant of BLEU with a higher weight for\nmatching entities and has higher correlation with human judgment for question generation than\nstandard metrics. We also perform two human evaluations, one to assess generation factuality, and\none for specificity. We define factuality as whether a statement can be corroborated by trusted external'}
{'indexid': 18, 'id': '1', 'url': 'https://arxiv.org/pdf/2005.11401', 'text': 'sources, and specificity as high mutual dependence between the input and output [33]. We follow\nbest practice and use pairwise comparative evaluation [34]. Evaluators are shown an answer and two\ngenerated questions, one from BART and one from RAG. They are then asked to pick one of four\noptions—quuestion A is better, question B is better, both are good, or neither is good.\n\n3.4 Fact Verification\n\nFEVER [56] requires classifying whether a natural language claim is supported or refuted by\nWikipedia, or whether there is not enough information to decide. The task requires retrieving\nevidence from Wikipedia relating to the claim and then reasoning over this evidence to classify\nwhether the claim is true, false, or unverifiable from Wikipedia alone. FEVER is a retrieval problem\ncoupled with an challenging entailment reasoning task. It also provides an appropriate testbed for\nexploring the RAG models’ ability to handle classification rather than generation. We map FEVER\nclass labels (supports, refutes, or not enough info) to single output tokens and directly train with\nclaim-class pairs. Crucially, unlike most other approaches to FEVER, we do not use supervision on\nretrieved evidence. In many real-world applications, retrieval supervision signals aren’t available, and\nmodels that do not require such supervision will be applicable to a wider range of tasks. We explore\ntwo variants: the standard 3-way classification task (supports/refutes/not enough info) and the 2-way\n(supports/refutes) task studied in Thorne and Vlachos [57]. In both cases we report label accuracy.\n\n4 Results\n\n4.1 Open-domain Question Answering\n\nTable 1 shows results for RAG along with state-of-the-art models. On all four open-domain QA\ntasks, RAG sets a new state of the art (only on the T5-comparable split for TQA). RAG combines\nthe generation flexibility of the “closed-book” (parametric only) approaches and the performance of\n"open-book" retrieval-based approaches. Unlike REALM and T5+SSM, RAG enjoys strong results\nwithout expensive, specialized “salient span masking” pre-training [20]. It is worth noting that RAG’s\nretriever is initialized using DPR’s retriever, which uses retrieval supervision on Natural Questions\nand TriviaQA. RAG compares favourably to the DPR QA system, which uses a BERT-based “cross-'}
{'indexid': 19, 'id': '1', 'url': 'https://arxiv.org/pdf/2005.11401', 'text': 'encoder” to re-rank documents, along with an extractive reader. RAG demonstrates that neither a\nre-ranker nor extractive reader is necessary for state-of-the-art performance.\n\nThere are several advantages to generating answers even when it is possible to extract them. Docu-\nments with clues about the answer but do not contain the answer verbatim can still contribute towards\na correct answer being generated, which is not possible with standard extractive approaches, leading\n\n5\n\n\nTable 1: Open-Domain QA Test Scores. For TQA,\nleft column uses the standard test set for Open-\nDomain QA, right column uses the TQA-Wiki\ntest set. See Appendix D for further details.\n\nModel NQ TQA WQ CT\n\nClosed\nBook\n\nT5-11B [52] 34.5 - /50.1 37.4 -\nT5-11B+SSM[52] 36.6 - /60.5 44.7 -\n\nOpen\nBook\n\nREALM [20] 40.4 - / - 40.7 46.8\nDPR [26] 41.5 57.9/ - 41.1 50.6\n\nRAG-Token 44.1 55.2/66.1 45.5 50.0\nRAG-Seq. 44.5 56.8/68.0 45.2 52.2\n\nTable 2: Generation and classification Test Scores.\nMS-MARCO SotA is [4], FEVER-3 is [68] and\nFEVER-2 is [57] *Uses gold context/evidence.\nBest model without gold access underlined.\n\nModel Jeopardy MSMARCO FVR3 FVR2\nB-1 QB-1 R-L B-1 Label Acc.\n\nSotA - - 49.8* 49.9* 76.8 92.2*\n\nBART 15.1 19.7 38.2 41.6 64.0 81.1\n\nRAG-Tok. 17.3 22.2 40.1 41.5 72.5 89.5RAG-Seq. 14.7 21.4 40.8 44.2\n\nto more effective marginalization over documents. Furthermore, RAG can generate correct answers\neven when the correct answer is not in any retrieved document, achieving 11.8% accuracy in such\ncases for NQ, where an extractive model would score 0%.\n\n4.2 Abstractive Question Answering'}

Note that the id/metadata is the same but the indexid and chunk text change with each row.

Retrieval

Last thing here is to illustrate a couple retrieval operations. LLMs are great at generating answers when we properly bound the context. See the two examples below.

print(embeddings.search("What is it called when LLM generation is bounded with factually correct data?", 1)[0]["text"])

including less of emphasis on lightly editing a retrieved item, but on aggregating content from several
pieces of retrieved content, as well as learning latent retrieval, and retrieving evidence documents
rather than related training pairs. This said, RAG techniques may work well in these settings, and
could represent promising future work.

6 Discussion

In this work, we presented hybrid generation models with access to parametric and non-parametric
memory. We showed that our RAG models obtain state of the art results on open-domain QA. We
found that people prefer RAG’s generation over purely parametric BART, finding RAG more factual
and specific. We conducted an thorough investigation of the learned retrieval component, validating
its effectiveness, and we illustrated how the retrieval index can be hot-swapped to update the model
without requiring any retraining. In future work, it may be fruitful to investigate if the two components
can be jointly pre-trained from scratch, either with a denoising objective similar to BART or some
another objective. Our work opens up new research directions on how parametric and non-parametric
memories interact and how to most effectively combine them, showing promise in being applied to a
wide variety of NLP tasks.

9


Broader Impact

This work offers several positive societal benefits over previous work: the fact that it is more
strongly grounded in real factual knowledge (in this case Wikipedia) makes it “hallucinate” less
with generations that are more factual, and offers more control and interpretability. RAG could be
employed in a wide variety of scenarios with direct benefit to society, for example by endowing it
with a medical index and asking it open-domain questions on that topic, or by helping people be more
effective at their jobs.

With these advantages also come potential downsides: Wikipedia, or any potential external knowledge
source, will probably never be entirely factual and completely devoid of bias. Since RAG can be
employed as a language model, similar concerns as for GPT-2 [50] are valid here, although arguably
to a lesser extent, including that it might be used to generate abuse, faked or misleading content in
the news or on social media; to impersonate others; or to automate the production of spam/phishing
content [54]. Advanced language models may also lead to the automation of various jobs in the
coming decades [16].

print(embeddings.search("Tell me about semantic search", 1)[0]["text"])

Traditional search systems use keywords to find data. Semantic search has an understanding of natural language and identifies results that have the same meaning, not necessarily the same keywords.

Get started with the following examples.

|Notebook|Description||
|---|---|---|
|[Introducing txtai](https://github.com/neuml/txtai/blob/master/examples/01_Introducing_txtai.ipynb) |Overview of the functionality provided by txtai||
|[Similarity search with images](https://github.com/neuml/txtai/blob/master/examples/13_Similarity_search_with_images.ipynb) |Embed images and text into the same space for search||
|[Build a QA database](https://github.com/neuml/txtai/blob/master/examples/34_Build_a_QA_database.ipynb) |Question matching with semantic search||
|[Semantic Graphs](https://github.com/neuml/txtai/blob/master/examples/38_Introducing_the_Semantic_Graph.ipynb) |Explore topics, data connectivity and run network analysis||

### LLM Orchestration
Autonomous agents, retrieval augmented generation (RAG), chat with your data, pipelines and workflows that interface with large language models (LLMs).

See below to learn more.

|Notebook|Description||
|---|---|---|
|[Prompt templates and task chains](https://github.com/neuml/txtai/blob/master/examples/44_Prompt_templates_and_task_chains.ipynb) |Build model prompts and connect tasks together with workflows||
|[Integrate LLM frameworks](https://github.com/neuml/txtai/blob/master/examples/53_Integrate_LLM_Frameworks.ipynb) |Integrate llama.cpp, LiteLLM and custom generation frameworks||
|[Build knowledge graphs with LLMs](https://github.

Note how both answers give more than enough information for a LLM to answer the question.

Wrapping up

This article covered how to build a retrieval system for RAG with txtai. Chunking and retrieval are key pieces of a RAG system, arguably the most important. With the commoditization of LLMs, it's going to be more and more important on how data is presented to LLMs. When given concise information, LLMs can take it from there!

DEV Community

Chunking your data for RAG

Install dependencies

Data chunking and indexing

Retrieval

Wrapping up

Top comments (0)

Read next

Biggest AI Developments – Game Changing Updates Last Week

Sprin AI Information and usages

Unlocking Copilot’s Full Potential: How I Built a Cross-Platform Chat History Exporter in Just 8 Hours

DeepSeek is now available on Microsoft Azure 🌚