DEV Community

Devesh Bhushan
Devesh Bhushan

Posted on

Building an AI Sales Agent: From Voice to Pitch.

The Story Behind It

When I signed up for EnCode 2025, The problem presented to me is to make an agent which is capable of delivering high-quality, natural-sounding voice interactions and should feel like talking to a real person with super low latency.
So,
I built a system that can handle complete sales conversations for online coaching centers - from greeting potential customers to understanding their needs and pitching relevant courses. All of that in a very postive tone human like speech. Think of it as your tireless sales agent that never has a bad day!

Technical Stack

Speech Processing: Whisper Large V3 Turbo for crystal-clear understanding.
Brain Power: LLaMA 3.3 70B for intelligent conversations.
Voice Output: F5 TTS for natural-sounding responses.
Memory: Pinecone vector database for context and information retrieval.
Demo Platform: Google Colab

How It All Works

In a simple flow it goes through 3 main system

  • Speech to Text (STT)

  • Large Language Model (LLM)

  • Text to Speech (TTS)

So,

User->STT->LLM->TTS->User

In detail:

  1. Customer speaks β†’ Whisper transcribes.
  2. Stage Manager (using regex) tracks conversation phase.
  3. Pinecone fetches multiple data from databases.
  4. LLaMA 3.3 70B crafts the perfect response.
  5. F5 TTS brings it to life with natural speech.

Cool Features You Might Like

  • Smart Voice Selection: 6 different AI voices (2 male, 4 female)
  • Context-Aware Responses: Thanks to vector similarity search
  • Structured Conversation Flow: Managed by a dedicated Stage Manager

Real Talk: Current Limitations

  • Demo runs on Colab.
  • Memory constraints with 8k token limit.
  • Heavy on computational resources.
  • API dependency for core functions.
  • Latency is quite high.

What I Learned

Technical Insights

  • Use of Vector Databases: Working with Pinecone showed me how vector DBs can be game-changers when having limited context window. The ability to perform similarity searches on conversation history and training materials in milliseconds is very powerful.

  • Stage Management Matters: When given the stage in which the conversation is currently in, it becomes so easy to incorporate examples related to the stage like examples on how to give a pitch what kind of questinos to ask etc.

  • Web Integration: The most critical part of all of this is to be able to pass data between frontend and backend effectively using fastapi. By using Webhooks, we were able to pass data back and forth while initialising the AI for call at a single time only and keeping the connection open for the whole conversation.

System Design Lessons

  • Chunking is Crucial: Processing audio in 5-second chunks rather than waiting for complete utterances significantly improved the user experience by allowing to reduce the time in processing.It's all about finding that sweet spot between accuracy and speed.

  • Modular Architecture Wins: Breaking the system into distinct services (STT, LLM, TTS) made development and debugging so much easier. When something went wrong, I could quickly pinpoint which part needed fixing.

Real-World Constraints

  • API Economics: Managing multiple API calls (Whisper, LLaMA) taught me the importance of optimizing API usage. Trying to use minimum number of API calls and still being fast is very challenging.

  • Reducing Latency: Reducing Latency is very hard when you are constently fetching and processing data from internet. In Future I will try to minimize the number of times one has to upload or download the data from internet.

Unexpected Challenges

  • Prompt Engineering: Prompt Engineering is very crucial and can decide whether your model is going to be as coherent as a human or is going to repeat the same sentence over and over again.

  • Context Window Limitations: The 8k token limit forced me to be creative with context management. Not storing the whole information but to get only the relevant chunks of it from vector database allowed me to design a data structure for LLM which have all the necessary info the LLM needs, to respond.

What's Next?

  • Using Multithreading to reduce latency.
  • Adding Multi-lingual Support.
  • Adding more bots type like 'Lead Bot' which call the customer to close the deal after intial lead.

Try It Yourself


GitHub

If you have some Suggestions regarding the system please ask them in comments.

Top comments (0)