With the OpenAI Realtime API, you can build speech-to-speech applications that let you interact directly with a generative AI model by speaking with it. Talking directly to a model feels really natural, and the Realtime API makes it possible to build experiences like this into your own applications and businesses.
One example of this was built by Twilio: it enables you to connect a phone call to GPT-4o with Node.js (or, if you prefer, Python). The example is great, but it only shows connecting to a plain GPT-4o with a system prompt that encourages owl facts and jokes. Much as I like owl facts, I wanted to see what else we could achieve with a voice agent like this.
In this post, we'll show you how to extend the original assistant into an agent that can choose to use tools to augment its response. We'll give it additional, up-to-date knowledge via retrieval-augmented generation (RAG) using Astra DB.
Want to try it out before we dive into the details? Call (855) 687-9438 (that's 855-6-TSWIFT) and have a chat!
Prerequisites
First, you’ll need to set up the application from the Twilio blog post, so you'll need a Twilio account and an OpenAI API key. Make sure you can make a call and chat with the bot successfully.
You will also need a free DataStax account so you can set up RAG with Astra DB.
What we’re going to build
We already have a voice-capable bot that you can speak to over the phone. We're going to gather some up-to-date data and store it in Astra DB to help the bot answer questions.
The OpenAI Realtime API enables you to define tools that the model can use to execute functions and extend its capabilities. We’ll give the model a tool that enables it to search the database for additional information (this is an example of agentic RAG).
Ingesting data
To test out this agent, we're going to write a quick script to load and parse a web page, turn the content into chunks, turn those chunks into vector embeddings, and store them in Astra DB.
Create your database
To kick this process off, you'll need to create a database. Log into your DataStax account and, on the Astra DB dashboard, click Create a Database. Choose a Serverless (Vector) database, give it a name, and pick a provider and region. That will take a couple of minutes to provision. While it's doing that, have a think about some good web pages you might want to ingest into this database.
Once the database is ready, click on the Data Explorer tab and then the Create Collection + button. Give your collection a name, ensure it is a vector-enabled collection and choose NVIDIA as the embedding generation method. This will automatically generate vector embeddings for the content we insert into the collection.
Connect to the database
Open the application code in your favourite text editor. To get the application running, you’ll have created a .env file and populated it with your OpenAI API key (and if you didn't do that yet, now is definitely the time). Open that .env file and add some more environment variables.
ASTRA_DB_APPLICATION_TOKEN=
ASTRA_DB_API_ENDPOINT=
ASTRA_DB_COLLECTION_NAME=
Fill in the variables with the information from your database. You can find the API endpoint and generate an application token from the database overview in the Astra DB dashboard. Enter the name of the collection you just created, too.
Now we can connect to the database in the application. Install the Astra DB client from npm.
npm install @datastax/astra-db-ts
Create a new file in the application called db.js. Open the file and enter the following code:
import { DataAPIClient } from "@datastax/astra-db-ts";
import dotenv from "dotenv";
dotenv.config();
const {
ASTRA_DB_APPLICATION_TOKEN,
ASTRA_DB_API_ENDPOINT,
ASTRA_DB_COLLECTION_NAME,
} = process.env;
const client = new DataAPIClient(ASTRA_DB_APPLICATION_TOKEN);
const db = client.db(ASTRA_DB_API_ENDPOINT);
export const collection = db.collection(ASTRA_DB_COLLECTION_NAME);
This code loads the client from the Astra DB module and the variables in the .env file into the environment. It then uses those environment variables as credentials to connect to the collection, and exports the collection object to be used elsewhere in the application.
Get some data
Now let's create a script that loads and parses a web page, then splits it into chunks and stores it in Astra DB. This script is going to combine some of the techniques in blog posts about scraping web pages, chunking text, and creating vector embeddings. To read more in depth about those, check out those posts.
Install the dependencies:
npm install @langchain/textsplitters @mozilla/readability jsdom
Create a file called ingest.js and copy the following code:
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { Readability } from "@mozilla/readability";
import { JSDOM } from "jsdom";
import { collection } from "./db.js";
import { parseArgs } from "node:util";
const { values } = parseArgs({
args: process.argv.slice(2),
options: { url: { type: "string", short: "u" } },
});
const { url } = values;
const html = await fetch(url).then((res) => res.text());
const doc = new JSDOM(html, { url });
const reader = new Readability(doc.window.document);
const article = reader.parse();
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 100,
});
const docs = (await splitter.splitText(article.textContent)).map((chunk) => ({
$vectorize: chunk,
}));
await collection.insertMany(docs);
This script:
- uses the Node.js argument parser to get a URL from the command line arguments
- loads the web page at that URL
- parses the content from the page using Readability.js and JSDOM
-
splits the text into 500 character chunks with 100 character overlap using the
RecursiveCharacterTextSplitter
- turns the chunks into objects where the chunk of text becomes the
$vectorize
property - inserts all the documents into the collection
Using the $vectorize
property tells Astra DB to automatically create vector embeddings for this content.
We can now run this file from the command line. For example, here's how to ingest the Wikipedia page on Taylor Swift:
node ingest.js --url https://en.wikipedia.org/wiki/Taylor_Swift
Once this command has been run, check the collection in the DataStax dashboard to see the contents and the vectors.
Build the voice agent
To turn our existing voice assistant into an agent that can choose to search the database for more information, we need to provide it with a tool, or function, that it can choose to use.
Create a new file called tools.js and open it in your editor. Start by importing collection
from db.js:
import { collection } from "./db.js"
Next we need to create the function that the agent can use to search the database.
When the OpenAI agent provides parameters to call a function with, it does so as an object. So the function should receive an object, from which we can destruct to extract the query. We'll then use the query to perform a vector search against our collection.
We can use Astra DB Vectorize to automatically create a vector embedding of the query. We'll also limit the results to the top 10 and ensure we return the text from the chunks by selecting $vectorize
in the projection.
Calling find
on the collection with these arguments will return a cursor, which we can turn into an array by calling toArray
. We then iterate over the array of documents, extracting just the text and then joining the resulting array with a newline to create a single string result that can be provided as context to the agent.
async function taylorSwiftFacts({ query }) {
const docs = await collection.find(
{},
{ $vectorize: query, limit: 10, projection: { $vectorize: 1 } }
);
return (await docs.toArray()).map((doc) => doc.$vectorize).join("\\n");
}
I've called the function taylorSwiftFacts
because that's what I loaded with my ingestion script; feel free to use a different name.
This is our first tool; we can write more, but for now we can just export this as an object of tools.
export const TOOLS = {
taylorSwiftFacts,
};
To help the model choose when to use this tool, it needs a description of what it can do and the arguments it expects. For each tool you provide a type, name, description, and the parameters.
For our function the type will be "function" and the name is taylorSwiftFacts
. The description will tell the agent that we have up-to-date information about Taylor Swift that it can search for. The parameters are a JSON schema description of the arguments your function expects, this tool is relatively simple as it only requires one parameter called query, which is a string. The full description looks like this:
export const DESCRIPTIONS = [
{
type: "function",
name: "taylorSwiftFacts",
description:
"Search for up to date information about Taylor Swift from her wikipedia page",
parameters: {
type: "object",
properties: {
query: {
type: "string",
description: "The search query",
},
},
},
},
];
Our tool definition is complete for now, so let's add them to our agent.
Handling function calls in a voice agent
We've been building supporting functions around the existing application so far, but to connect our tool to the agent we need to dig into the main body of code. Open index.js in our editor and start by importing the tool we just defined:
import Fastify from 'fastify';
import WebSocket from 'ws';
import dotenv from 'dotenv';
import fastifyFormBody from '@fastify/formbody';
import fastifyWs from '@fastify/websocket';
import { DESCRIPTIONS, TOOLS } from "./tools.js";
We need to update the system prompt to more accurately describe what the agent is capable of with the tool available to it. Since we ingested the wikipedia page for Taylor Swift earlier, we can update it to behave like a Taylor Swift superfan.
Find the SYSTEM_MESSAGE
constant and update with:
const SYSTEM_MESSAGE = "You are a helpful and bubbly AI assistant who loves Taylor Swift. You can use your knowledge about Taylor Swift to answer questions, but if you don't know the answer, you can search for relevant facts with your available tools.";
Next we need to provide the tool we have built to the agent. Find the initializeSession
function, it defines a sessionUpdate
object that includes all the details to initialize the agent. Add a tools property to the session object using the DESCRIPTIONS
object we imported earlier:
const sessionUpdate = {
type: 'session.update',
session: {
turn_detection: { type: 'server_vad' },
input_audio_format: 'g711_ulaw',
output_audio_format: 'g711_ulaw',
voice: VOICE,
instructions: SYSTEM_MESSAGE,
modalities: ["text", "audio"],
temperature: 0.8,
tools: DESCRIPTIONS
}
};
We can also provide tools on a request-by-request basis, but this agent will benefit from access to this tool in all its interactions.
Finally we need to handle the event when the model requests to use a tool. Find the event handler for when the connection to OpenAI receives a message, it looks like: openAiWs.on('message', … )
.
Change the event handler to an async
function:
openAiWs.on('message', async (data) => {
When the Realtime API wants to use a tool, it sends an event with the type "response.done." Within the event object there are outputs, and if one of the outputs has a type of "function_call" we know the model wants to use one of its tools.
The output provides the name of the function it wants to call and the arguments. We can look up the tool in our object of TOOLS
that we imported, then call it with the arguments.
When we have the result of the function call we pass it back to the model so that it can choose what to do next. We do so by creating a new message with the type "conversation.item.create" and within that message we include an item with the type "function_call_output", the output of the function call, and the ID that the original event had, so that the model can tie the response to the original query.
We send this to the model as well as another message with the type "response.create" which requests the model use this new information to return a new response.
Overall, this enables the model to request to use the database search function we defined and provide the arguments it wants to call the function with. We are then responsible for calling the function and returning the results to the model. The whole code looks like this:
openAiWs.on('message', async (data) => {
try {
const response = JSON.parse(data);
if (LOG_EVENT_TYPES.includes(response.type)) {
console.log(`Received event: ${response.type}`, response);
}
if (response.type === "response.done") {
const outputs = response.response.output;
const functionCall = outputs.find(
(output) => output.type === "function_call"
);
if (functionCall && TOOLS[functionCall.name]) {
const result = await TOOLS[functionCall.name](
JSON.parse(functionCall.arguments)
);
const conversationItemCreate = {
type: "conversation.item.create",
item: {
type: "function_call_output",
call_id: functionCall.call_id,
output: result,
},
};
openAiWs.send(JSON.stringify(conversationItemCreate));
openAiWs.send(JSON.stringify({ type: "response.create" }));
}
}
// other event handlers
Start the application and make sure it is connected to your Twilio number as described in the Twilio blog post. Now we can call and chat all things Taylor Swift.
If you want to try this out with my assistant, you can give it a call on (855) 687-9438.
This is now a new way to connect with the Taylor Swift bot we built a while back. So now you can chat with SwiftieGPT online or on the phone.
Give your voice assistants some agency
Real-time voice agents are very cool, but they have all the same drawbacks as a plain LLM. In this post we added agentic RAG capabilities to our voice agent and it was able to use up-to-date knowledge to answer our questions about Taylor Swift.
When you provide a voice agent with tools, like context from a vector database, the results are very impressive. The combination of Twilio, OpenAI, and Astra DB creates a very powerful agent.
You can find the code to this in my fork of the Twilio project. You don't have to stop here though; you can define and add further tools to the agent. Make sure you check out OpenAI's best practices for defining functions for your models.
If you're interested in building other agents, check out how to work with Langflow and Composio or the workshop and videos from the recent Hacking Agents event.
Are you excited about voice agents or agentic RAG? Come chat about it and what you're building in the DataStax Devs Discord.
Want to roll up your sleeves and build with OpenAI, Twilio, Cloudflare, Unstructured, and DataStax? Join us on Feb. 28 in San Francisco for the Hacking Agents Hackathon, an epic 24-hour hackathon where we'll be diving into what developers can build with the latest and greatest in AI tooling.
Top comments (0)