Voice AI: How to build a voice AI assistant?

#ai #machinelearning #python #productivity

Voice is an interesting platform; can you imagine building a voice service that can pick up the phone and have a conversation with a human in real-time?

This may have been far-fetched a few years ago, but now it's totally possible and relatively simple to achieve.

How to achieve this?

To build a Voice AI service, you need a few components:

SIP: A protocol that allows a regular phone to make and receive calls over the internet. Usually with SIP, you need a softphone like Linphone and a SIP account (a username, password, and sip domain). For VoiceAI, you'll need to build a headless client that can run as a daemon. An alternative is just to use Twilio media streams.
Audio transcription: Service such as OpenAI Whisper. This will be responsible for translating audio into text.
Text-To-Speech: A service such as OpenAI TTS.

To keep this article simple, we'll just use Twilio's media stream service.

What are Twilio media streams?

Twilio is the most popular digital telecoms service provider around, they offer a wide variety of voice, WhatsApp, and SMS services which makes our lives as developers much easier. Instead of building a headless SIP phone from scratch, we can just use media streams.

I am not affiliated with Twilio in any way, but I do use quite a bit of their APIs in my own projects.

Media streams provide a WebSocket connection enabling you to receive or make calls programmatically.

When a call comes in, Twilio will connect to your WebSocket and stream the audio, which you can then process using AI and send back a piece of audio directly to the caller.

Setting up a WebSocket

Flask is a bit dated! I prefer FastAPI, however, to make it easier for you since Twilio docs generally use Flask - we'll stick with Flask.

from flask import Flask
from flask_sockets import Sockets
from openai import OpenAI
import json
from io import BytesIO
import base64

import time
import pywav
import uuid
from pydub import AudioSegment
import os

HTTP_SERVER_PORT = 8000

app = Flask(__name__)
sockets = Sockets(app)

def buffer_size(audio_buffer):
    sample_rate = 8000
    total_samples = sum(len(chunk) for chunk in audio_buffer)
    duration_seconds = total_samples / sample_rate
    return duration_seconds

def log(msg, *args):
    print(f"Media WS: ", msg, *args)

def respond_to_call_transcription(self, text):
    answer = None
    """
    Pass on to your LLM logic and generate an answer.
    """
    return answer

def transcribe_to_text(audio_file):
    tmpFileName = f"/tmp/_audio_buffer_{uuid.uuid4()}.wav"
    def convert_to_wav(audio):
        data_bytes = b"".join(audio)
        wave_write = pywav.WavWrite(tmpFileName, 1, 8000, 8, 7)
        wave_write.write(data_bytes)
        wave_write.close()

        return open(tmpFileName, "rb")
    client = OpenAI()

    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=convert_to_wav(audio_file)
    )

    os.unlink(tmpFileName)
    try:
        return transcription.text
    except Exception as ex:
        log(ex)
    return None

def text_to_speech(text):
    client = OpenAI()
    i = 0
    while i <= 3:
        try:
            i += 1
            response = client.audio.speech.create(
                model="tts-1",
                voice="nova",
                input=text,
                response_format="wav"
            )

            tmpFileName = f"/tmp/_audio_output_buffer_{uuid.uuid4()}.wav"
            response.stream_to_file(tmpFileName)
            audio = AudioSegment.from_file(tmpFileName, format="wav")
            audio = audio.set_frame_rate(8000)
            audio = audio.set_channels(1)
            raw_audio = BytesIO()
            audio.export(raw_audio, format="mulaw")
            raw_audio.seek(0)
            raw_data = raw_audio.read()
            raw_data = base64.b64encode(raw_data).decode("utf-8")

            return raw_data
        except Exception as ex:
            log(ex)

    with open("./audio/error.txt", "r") as f:
        return f.read()

@sockets.route('/websocket')
def echo(ws):
    count = 0
    audio_buffer = []
    streamSid = None
    silence = None

    while not ws.closed:
        message = ws.receive()
        if message is None:
            log("No message received...")
            continue

        data = json.loads(message)

        if data['event'] == 'start':
            log("Connection accepted", data)
            streamSid = data['streamSid']

        if data['event'] == "media":
            buff = data['media']['payload']
            if silence is None and "////////////////////w==" in str(buff):
                silence = time.time()
            elif silence is not None and (time.time() - silence) > 0.3 and buffer_size(audio_buffer) >= 2:
                silence = None
                transcribe = transcribe_to_text(audio_buffer)

                if transcribe:
                    answer = ""
                    try:
                        log("Prompting AI: " + transcribe)
                        response = respond_to_call_transcription(transcribe)
                        answer = response
                        log("AI Said: " + answer)
                    except Exception as ex:
                        answer = open("./audio/error.txt", "r").read()
                        log(ex)

                    try:
                        payload ={
                            "event": "media",
                            "media": {
                                "payload": text_to_speech(answer),
                                },
                                "streamSid": streamSid
                            }

                        ws.send(json.dumps(payload))
                        ws.send(json.dumps({
                                "event": "mark",
                                "streamSid": streamSid,
                                "mark": {
                                    "name": f"chunk_{time.time()}"
                                }
                            }))
                    except Exception as ex:
                        log(ex)

                    audio_buffer = []

            elif "////////////////////w==" not in str(buff):
                silence = None
                audio_data = base64.b64decode(buff)
                audio_buffer.append(audio_data)

        if data['event'] == "stop":
            log("STOPED Message received", message)
        if data['event'] == "closed":
            log("Closed Message received", message)

            break
        count += 1

    log("Connection closed. Received a total of {} messages".format(count))


if __name__ == '__main__':
    from gevent import pywsgi
    from geventwebsocket.handler import WebSocketHandler

    server = pywsgi.WSGIServer(('', HTTP_SERVER_PORT), app, handler_class=WebSocketHandler)
    print("Server listening on: http://0.0.0.0:" + str(HTTP_SERVER_PORT))
    server.serve_forever()

This code is not very optimized and is just a sample to give you a general idea of how to handle the WebSocket connection.

In my next series of articles, I will take a deeper dive into each aspect of this code and eventually build out a production-grade WebSocket service.

DEV Community

Voice AI: How to build a voice AI assistant?

How to achieve this?

What are Twilio media streams?

Setting up a WebSocket

Top comments (0)

Read next

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Toward Autonomous Driving by Musculoskeletal Humanoids: A Study of Developed Hardware and Learning-Based Software

Beware of recursive signals in Django

Discovering Preference Optimization Algorithms with and for Large Language Models