David Sola

Posted on Jan 29

Chatbot with Semantic Kernel - Part 5: Text-to-speech 📣

#ai #semantickernel #python #microsoft

In the last chapter, we added the first audio capability to our chatbot by allowing the user to interact with the model using their voice. In this chapter, we are going to add the opposite skill: giving a voice to our chatbot.

Text-to-speech

In recent times, models have vastly improved in creating audio based on text input. In some cases, model providers offer standalone models for Text-to-speech, like TTS from OpenAI. On the other hand, we have also the possibility of using more powerful models that support both multimodal input (text, image, video and audio) and output (text, image and voice). Some examples of these more powerful models are Gemini 2.0 flash from Google and GPT-4o-realtime from OpenAI.

The possibility of generating high quality audio thanks to these TTS models, combined with the potential of powerful text models (like GPT-4o), has enabled many use cases that were unimaginable just a few years ago. For example, in 2024, Google released NotebookLM, an application that generates podcasts based on sources uploaded by the user. If you are researching evaluation techniques for LLMs, you can upload materials such as papers or articles, and the application creates a podcast where two AI voices have a conversation summarizing and explaining your material.

Text-to-speech on Semantic Kernel

In November 2024, Microsoft added audio capabilities support to Semantic Kernel. For the Text-to-speech scenario, we will build the following workflow:

Introduce a text or audio input. You can check the previous article where we added Audio-to-text functionality.
Use a standard LLM to generate a response from the user's input.
Use the TTS model from OpenAI to convert the response into audio (WAVformat).
Reproduce the generated audio to the user.

Based on our previous chatbot, the two first steps are already accomplished. Let's now focus on converting the text response into an audio with the TTS model.

Generate audio

First of all, we need to inject a new service into our Kernel. In this case, we register an AzureTextToAudio service:

# Inject the service into the Kernel
self.kernel.add_service(AzureTextToAudio(
    service_id='text_to_audio_service'
))

# Get the service from the Kernel
self.text_to_audio_service:AzureTextToAudio = self.kernel.get_service(type=AzureTextToAudio)

Because the service is declared as an Azure service, it uses the following environment variables:

AZURE_OPENAI_TEXT_TO_AUDIO_DEPLOYMENT_NAME: the name of the model deployed in Azure OpenAI.
AZURE_OPENAI_API_KEY: the API key associated to the Azure OpenAI instance.
AZURE_OPENAI_ENDPOINT: the endpoint associated to the Azure OpenAI instance.

Similarly, Semantic Kernel has many AI connectors, like the OpenAITextToAudio service. In that case, the name of the variables would be:

OPENAI_AUDIO_TO_TEXT_MODEL_ID: the OpenAI audio to text model ID to use.
OPENAI_API_KEY: the API key associated to your organization.
OPENAI_ORG_ID: the unique identifier for your organization.

You can check all the settings used on Semantic Kernel on the official Github repository.

The TextToAudio service is quite simple to use. It has two important methods:

get_audio_contents: return a list of generated audio contents. Some models do not support generation of multiple audios from one single input, in that case the list will contain only one element.
get_audio_content: identical to previous method but always return the first element of the list.

Both methods have an optional argument OpenAITextToAudioExecutionSettings, to customize the behavior of the service. With the current version of Semantic Kernel, you can customize the speed of the playback, the voice used (with Alloy being the default one), and the output format. In this case, I have decided to use the echo voice in WAV format:

async def generate_audio(self, message: str) -> bytes:
    audio_settings = OpenAITextToAudioExecutionSettings(voice='echo', response_format="wav")
    audio_content = await self.text_to_audio_service.get_audio_content(message, audio_settings)
    return audio_content.data

The output generated by the method is a list of bytes containing the audio. Now we can easily use the output of the standard response from the LLM to generate the corresponding audio:

response = await assistant.generate_response(text)
add_message_chat('assistant', response)

if config['audio'] == 'enabled':
    audio = await assistant.generate_audio(response)

Reproducing the audio

Once we have the audio generated, we need some code to reproduce it on the user's computer. For that purpose, I have created a simple AudioPlayer class using pyaudio library:

import io
import wave
import pyaudio

class AudioPlayer:
    def play_wav_from_bytes(self, wav_bytes, chunk_size=1024):
        p = pyaudio.PyAudio()

        try:
            wav_io = io.BytesIO(wav_bytes)

            with wave.open(wav_io, 'rb') as wf:
                channels = wf.getnchannels()
                rate = wf.getframerate()

                stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                                channels=channels,
                                rate=rate,
                                output=True)

                data = wf.readframes(chunk_size)
                while len(data) > 0:
                    stream.write(data)
                    data = wf.readframes(chunk_size)

                stream.stop_stream()
                stream.close()

        finally:
            p.terminate()

Finally, we call the play_wav_from_bytes method to reproduce the audio generated by the model:

# Generate response with standard LLM
response = await assistant.generate_response(text)

# Add response to the user interface
add_message_chat('assistant', response)

if config['audio'] == 'enabled':
    # Generate audio from the text
    audio = await assistant.generate_audio(response)

    # Reproduce audio
    player = AudioPlayer()
    player.play_wav_from_bytes(audio)

Summary

In this chapter, we have provided a voice to our chatbot thanks to a Text-to-speech model. We have transformed our agent into a multimodal agent by supporting text and audio as input and output.

In the next chapter, we will integrate the chatbot with Ollama to enable the use of locally run models

Remember that all the code is already available on my GitHub repository 🐍 PyChatbot for Semantic Kernel.

DEV Community

Chatbot with Semantic Kernel - Part 5: Text-to-speech 📣

Text-to-speech

Text-to-speech on Semantic Kernel

Generate audio

Reproducing the audio

Summary

Top comments (0)

Read next

JSON Unescape: Understanding and Using It Effectively

🚀 Pro In Flow - Boost Your Developer Productivity with Intelligent Agents and Powerful Features

Stop wasting time on ontology setup

How to Use JUnit on VS Code: A Comprehensive Guide