In the last chapter, we added the first audio capability to our chatbot by allowing the user to interact with the model using their voice. In this chapter, we are going to add the opposite skill: giving a voice to our chatbot.
Text-to-speech
In recent times, models have vastly improved in creating audio based on text input. In some cases, model providers offer standalone models for Text-to-speech, like TTS from OpenAI. On the other hand, we have also the possibility of using more powerful models that support both multimodal input (text, image, video and audio) and output (text, image and voice). Some examples of these more powerful models are Gemini 2.0 flash from Google and GPT-4o-realtime from OpenAI.
The possibility of generating high quality audio thanks to these TTS models, combined with the potential of powerful text models (like GPT-4o), has enabled many use cases that were unimaginable just a few years ago. For example, in 2024, Google released NotebookLM, an application that generates podcasts based on sources uploaded by the user. If you are researching evaluation techniques for LLMs, you can upload materials such as papers or articles, and the application creates a podcast where two AI voices have a conversation summarizing and explaining your material.
Text-to-speech on Semantic Kernel
In November 2024, Microsoft added audio capabilities support to Semantic Kernel. For the Text-to-speech scenario, we will build the following workflow:
- Introduce a text or audio input. You can check the previous article where we added Audio-to-text functionality.
- Use a standard LLM to generate a response from the user's input.
- Use the TTS model from OpenAI to convert the response into audio (WAVformat).
- Reproduce the generated audio to the user.
Based on our previous chatbot, the two first steps are already accomplished. Let's now focus on converting the text response into an audio with the TTS model.
Generate audio
First of all, we need to inject a new service into our Kernel. In this case, we register an AzureTextToAudio
service:
# Inject the service into the Kernel
self.kernel.add_service(AzureTextToAudio(
service_id='text_to_audio_service'
))
# Get the service from the Kernel
self.text_to_audio_service:AzureTextToAudio = self.kernel.get_service(type=AzureTextToAudio)
Because the service is declared as an Azure
service, it uses the following environment variables:
-
AZURE_OPENAI_TEXT_TO_AUDIO_DEPLOYMENT_NAME
: the name of the model deployed in Azure OpenAI. -
AZURE_OPENAI_API_KEY
: the API key associated to the Azure OpenAI instance. -
AZURE_OPENAI_ENDPOINT
: the endpoint associated to the Azure OpenAI instance.
Similarly, Semantic Kernel has many AI connectors, like the OpenAITextToAudio
service. In that case, the name of the variables would be:
-
OPENAI_AUDIO_TO_TEXT_MODEL_ID
: the OpenAI audio to text model ID to use. -
OPENAI_API_KEY
: the API key associated to your organization. -
OPENAI_ORG_ID
: the unique identifier for your organization.
You can check all the settings used on Semantic Kernel on the official Github repository.
The TextToAudio
service is quite simple to use. It has two important methods:
-
get_audio_contents
: return a list of generated audio contents. Some models do not support generation of multiple audios from one single input, in that case the list will contain only one element. -
get_audio_content
: identical to previous method but always return the first element of the list.
Both methods have an optional argument OpenAITextToAudioExecutionSettings
, to customize the behavior of the service. With the current version of Semantic Kernel, you can customize the speed of the playback, the voice used (with Alloy being the default one), and the output format. In this case, I have decided to use the echo voice in WAV format:
async def generate_audio(self, message: str) -> bytes:
audio_settings = OpenAITextToAudioExecutionSettings(voice='echo', response_format="wav")
audio_content = await self.text_to_audio_service.get_audio_content(message, audio_settings)
return audio_content.data
The output generated by the method is a list of bytes containing the audio. Now we can easily use the output of the standard response from the LLM to generate the corresponding audio:
response = await assistant.generate_response(text)
add_message_chat('assistant', response)
if config['audio'] == 'enabled':
audio = await assistant.generate_audio(response)
Reproducing the audio
Once we have the audio generated, we need some code to reproduce it on the user's computer. For that purpose, I have created a simple AudioPlayer
class using pyaudio
library:
import io
import wave
import pyaudio
class AudioPlayer:
def play_wav_from_bytes(self, wav_bytes, chunk_size=1024):
p = pyaudio.PyAudio()
try:
wav_io = io.BytesIO(wav_bytes)
with wave.open(wav_io, 'rb') as wf:
channels = wf.getnchannels()
rate = wf.getframerate()
stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
channels=channels,
rate=rate,
output=True)
data = wf.readframes(chunk_size)
while len(data) > 0:
stream.write(data)
data = wf.readframes(chunk_size)
stream.stop_stream()
stream.close()
finally:
p.terminate()
Finally, we call the play_wav_from_bytes
method to reproduce the audio generated by the model:
# Generate response with standard LLM
response = await assistant.generate_response(text)
# Add response to the user interface
add_message_chat('assistant', response)
if config['audio'] == 'enabled':
# Generate audio from the text
audio = await assistant.generate_audio(response)
# Reproduce audio
player = AudioPlayer()
player.play_wav_from_bytes(audio)
Summary
In this chapter, we have provided a voice to our chatbot thanks to a Text-to-speech model. We have transformed our agent into a multimodal agent by supporting text and audio as input and output.
In the next chapter, we will integrate the chatbot with Ollama
to enable the use of locally run models
Remember that all the code is already available on my GitHub repository 🐍 PyChatbot for Semantic Kernel.
Top comments (0)