In this post, I will show you how to build a real-time voice assistant with Mistral AI and FastRTC.
Mistral AI is one of the leading LLM providers out there, and they have made their LLM API easily accessible to developers.
FastRTC, on the other hand, is a real-time communication library for Python that enables you to quickly turn any Python function into real-time audio and video stream over WebRTC or WebSockets.
Building A Real-time Voice Assistant
First, let's install the required libraries by running the code below in your terminal
pip install mistalai fastrtc
Next, set an environment variable and import the libraries. Create a .env
file in your project and save your Mistral API key there
MISTRAL_API_KEY = "<your-api-key>"
Import the libraries
from mistralai import Mistral
from fastrtc import (ReplyOnPause, Stream, get_stt_model, get_tts_model)
from dotenv import load_dotenv
import os
load_dotenv()
To get your Mistral API key, you will need to create an account on their website.
Although we are using Mistral LLM in this project, in reality, you can plug in any LLM and get real-time voice responses.
In the above, we have imported Mistral and the specific methods that we need from FastRTC namely ReplyOnPause(), Stream(), get_stt_model(), and get_tts_model()
ReplyOnPause()
: This method takes a Python audio function. It monitors the audio, and when it detects a pause, it takes it as a cue to give a reply.
Stream()
: This method streams the audio reply.
get_stt_model()
: This is used to access the speech-to-text model that is used to convert audio to text.
get_tts_model()
: This is used to access the text-to-speech model that is used to convert text back into audio.
Now, let's activate the Mistral client with our API key stored in the .env
file
api_key = os.environ["MISTRAL_API_KEY"]
model = "mistral-large-latest"
client = Mistral(api_key=api_key)
Here we are using Mistral large model. However, you can try out other Mistral models too.
We will now build the audio function that will take a prompt and return a response
stt_model = get_stt_model()
tts_model = get_tts_model()
def echo(audio):
prompt = stt_model.stt(audio)
chat_response = client.chat.complete(
model = model,
messages = [
{
"role": "user",
"content": f"{prompt}"
},
]
)
for audio_chunk in tts_model.stream_tts_sync(chat_response.choices[0].message.content):
yield audio_chunk
Above, we wrote a function called echo
, and the function takes an audio input, then passes that to the speech-to-text method, which is converted to a user prompt and given to the LLM. The response from the LLM is then passed to a text-to-speech method and is streamed synchronously.
Finally, we will run the application
stream = Stream(ReplyOnPause(echo), modality="audio", mode="send-receive")
stream.ui.launch()
This will launch the UI below at the URL: http://127.0.0.1:7860/
Now, you can activate the microphone and say something to your assistant who will give you a reply immediately.
Change Voice
If you do not like the default voice, you can change that by passing an instance of KokoroTTSOptions()
to the text-to-speech method.
First import KokoroTTSOptions()
from FastRTC by adding it to the import tuple
from fastrtc import (ReplyOnPause, Stream, get_stt_model, get_tts_model, KokoroTTSOptions)
Next, define the options
tts_model = get_tts_model(model="kokoro")
options = KokoroTTSOptions(
voice="af_bella",
speed=1.0,
lang="en-us"
)
Then pass the options to the text-to-speech method in your audio function
for audio_chunk in tts_model.stream_tts_sync(chat_response.choices[0].message.content, options = options)
yield audio_chunk
For more voice options, you can check out KokoroTTS documentation.
Complete Project Code
Here is the complete code that we have used to create the real-time voice assistant
import os
from fastrtc import (ReplyOnPause, Stream, get_stt_model, get_tts_model, KokoroTTSOptions)
from dotenv import load_dotenv
from mistralai import Mistral
load_dotenv()
api_key = os.environ["MISTRAL_API_KEY"]
model = "mistral-large-latest"
client = Mistral(api_key=api_key)
options = KokoroTTSOptions(
voice="af_bella",
speed=1.0,
lang="en-us"
)
stt_model = get_stt_model()
tts_model = get_tts_model(model="kokoro")
def echo(audio):
prompt = stt_model.stt(audio)
chat_response = client.chat.complete(
model = model,
messages = [
{
"role": "user",
"content": f"{prompt}"
},
]
)
for audio_chunk in tts_model.stream_tts_sync(chat_response.choices[0].message.content, options=options):
yield audio_chunk
stream = Stream(ReplyOnPause(echo), modality="audio", mode="send-receive")
stream.ui.launch()
I hope you found this post useful. Thanks for reading!
Top comments (0)