Devesh Bhushan

Posted on Jan 18

Building an AI Sales Agent: From Voice to Pitch.

#python #machinelearning #hackathon #showdev

The Story Behind It

When I signed up for EnCode 2025, The problem presented to me is to make an agent which is capable of delivering high-quality, natural-sounding voice interactions and should feel like talking to a real person with super low latency.
So,
I built a system that can handle complete sales conversations for online coaching centers - from greeting potential customers to understanding their needs and pitching relevant courses. All of that in a very postive tone human like speech. Think of it as your tireless sales agent that never has a bad day!

Technical Stack

Speech Processing: Whisper Large V3 Turbo for crystal-clear understanding.
Brain Power: LLaMA 3.3 70B for intelligent conversations.
Voice Output: F5 TTS for natural-sounding responses.
Memory: Pinecone vector database for context and information retrieval.
Demo Platform: Google Colab

How It All Works

In a simple flow it goes through 3 main system

Speech to Text (STT)
Large Language Model (LLM)
Text to Speech (TTS)

So,

User->STT->LLM->TTS->User

In detail:

Customer speaks → Whisper transcribes.
Stage Manager (using regex) tracks conversation phase.
Pinecone fetches multiple data from databases.
LLaMA 3.3 70B crafts the perfect response.
F5 TTS brings it to life with natural speech.

Cool Features You Might Like

Smart Voice Selection: 6 different AI voices (2 male, 4 female)
Context-Aware Responses: Thanks to vector similarity search
Structured Conversation Flow: Managed by a dedicated Stage Manager

Real Talk: Current Limitations

Demo runs on Colab.
Memory constraints with 8k token limit.
Heavy on computational resources.
API dependency for core functions.
Latency is quite high.

What I Learned

Technical Insights

Use of Vector Databases: Working with Pinecone showed me how vector DBs can be game-changers when having limited context window. The ability to perform similarity searches on conversation history and training materials in milliseconds is very powerful.
Stage Management Matters: When given the stage in which the conversation is currently in, it becomes so easy to incorporate examples related to the stage like examples on how to give a pitch what kind of questinos to ask etc.
Web Integration: The most critical part of all of this is to be able to pass data between frontend and backend effectively using fastapi. By using Webhooks, we were able to pass data back and forth while initialising the AI for call at a single time only and keeping the connection open for the whole conversation.

System Design Lessons

Chunking is Crucial: Processing audio in 5-second chunks rather than waiting for complete utterances significantly improved the user experience by allowing to reduce the time in processing.It's all about finding that sweet spot between accuracy and speed.
Modular Architecture Wins: Breaking the system into distinct services (STT, LLM, TTS) made development and debugging so much easier. When something went wrong, I could quickly pinpoint which part needed fixing.

Real-World Constraints

API Economics: Managing multiple API calls (Whisper, LLaMA) taught me the importance of optimizing API usage. Trying to use minimum number of API calls and still being fast is very challenging.
Reducing Latency: Reducing Latency is very hard when you are constently fetching and processing data from internet. In Future I will try to minimize the number of times one has to upload or download the data from internet.

Unexpected Challenges

Prompt Engineering: Prompt Engineering is very crucial and can decide whether your model is going to be as coherent as a human or is going to repeat the same sentence over and over again.
Context Window Limitations: The 8k token limit forced me to be creative with context management. Not storing the whole information but to get only the relevant chunks of it from vector database allowed me to design a data structure for LLM which have all the necessary info the LLM needs, to respond.

What's Next?

Using Multithreading to reduce latency.
Adding Multi-lingual Support.
Adding more bots type like 'Lead Bot' which call the customer to close the deal after intial lead.

DEV Community

Building an AI Sales Agent: From Voice to Pitch.

The Story Behind It

Technical Stack

How It All Works

In detail:

Cool Features You Might Like

Real Talk: Current Limitations

What I Learned

Technical Insights

System Design Lessons

Real-World Constraints

Unexpected Challenges

What's Next?

Try It Yourself

Top comments (0)

Read next

Building Bedrock Agents for AWS Account Metadata and Cost Analysis

Python crawler practice: using 98ip proxy IP to obtain cross-border e-commerce data

Comparing Open-Source Vision Models for Photo Description Tasks Using .NET Aspire

Python 🐍 and variable types