DEV Community

Haseeb Arshad
Haseeb Arshad

Posted on

Sayings: Bringing Social Media to Life Through Voice

This is a submission for the AssemblyAI Challenge: Sophisticated Speech-to-Text.

What I Built

In the digital age, social media has become the go-to platform for people to share their thoughts and connect with others. However, traditional social media platforms often lack the depth of human interaction—text lacks tone, emotion, and the nuances of voice. To bridge this gap, I built Sayings., a social media application that centers around voice interactions.

Sayings. is designed to make online interactions more personal and expressive by allowing users to post voice messages instead of text. Users can record their voice, and the app transcribes the audio, detects emotions, and extracts topics from the speech. The goal is to create a platform where users can truly understand each other by hearing the tone, emotion, and expression in each other's voices.

Demo

Login Image Sayings.me
Login screen of Sayings.me

Signup Image Sayings.me
Signup screen of Sayings.me

Home Page of Sayings.me which is the timeline of every user where all the posts and the sidebar of top topics will be displayed
Home page of Sayings.me displaying the timeline and top topics sidebar

Home Page of Sayings.me - the audio is being played of the filtered posts
Playing audio of filtered posts on the home page

The audio being recorded after pressing the audio button
Recording audio after pressing the audio button

A personalized user profile that tells about the user's personality traits, profile review and the user's voice posts
Personalized user profile with personality traits and voice posts

Journey

Inspiration

The inspiration behind Sayings. stemmed from the realization that text-based communication often fails to convey the full spectrum of human emotions. I wanted to create a platform that brings the warmth and authenticity of real-life conversations to the digital and social world.

Technology Stack

To build Sayings., I used the following technologies:

  • Frontend:
    • Next.js for server-side rendering
    • React for building interactive UI components
  • Backend:
    • Node.js and Express.js for handling API requests and server-side logic
  • Database:
    • MongoDB with Mongoose ORM for data modeling
  • Authentication:
    • JSON Web Tokens (JWT) and bcrypt for secure user authentication
  • Speech-to-Text and AI:
    • AssemblyAI's Universal-2 Speech-to-Text API for transcribing user voice posts into text with high accuracy
    • Hume AI's Emotion Detection API for detecting emotions from voice inputs
    • Grok's API for generating personality insights and profile reviews based on user interactions
  • File Storage:
    • Pinata's IPFS API for decentralized and secure audio file storage
  • Security:
    • Implemented rate limiting, CORS policies, and input validation to ensure the app is secure and robust

Architecture Overview

The application is divided into several key components:

  1. User Authentication and Profile Management:

    Users can create profiles, log in securely, and manage their personal information. The authentication system ensures user data privacy and security.

  2. Voice Posting and Transcription:

    Users can record voice messages directly in the app. These audio files are sent to the backend, where they are processed using AssemblyAI's Universal-2 Speech-to-Text API to generate accurate transcriptions.

  3. Emotion and Topic Detection:

    The transcribed text and audio are analyzed using Hume AI's Emotion Detection API to identify the emotions expressed. Additionally, topics are extracted to categorize the content, which feeds into trending topics and personalized feeds.

  4. Personality Insights:

    Using data from user interactions and posts, Grok's API generates personality insights and profile reviews, providing users with a deeper understanding of themselves and others.

  5. Feed and Interaction:

    Users can interact with posts by liking, commenting, and sharing. The feed is personalized based on user interests, trending topics, and interactions.

Incorporating AssemblyAI's Universal-2 Speech-to-Text Model

AssemblyAI's Universal-2 Speech-to-Text model is at the heart of Sayings. It provides:

  • High Accuracy Transcriptions:

    Ensuring that voice posts are accurately converted to text, capturing the nuances of speech.

  • Timestamp Generation:

    Allows synchronization of transcriptions with audio playback, enabling features like highlighting words as they are spoken.

  • Proper Noun Recognition and Formatting:

    Preserves the integrity of names, places, and other proper nouns, enhancing readability and comprehension.

By leveraging Universal-2, users can trust that their spoken words are faithfully represented in text, which is crucial for features like topic detection and sentiment analysis.

The Architecture of the Sayings.me Project
Architecture diagram of the Sayings.me project

Key Features

  1. Sophisticated Speech-to-Text

    Sayings. showcases the power of AssemblyAI's Universal-2 model by:

    • Implementing advanced transcription features, including punctuation, capitalization, and formatting.
    • Utilizing timestamps for word-level synchronization with audio playback.
    • Extracting key topics and sentiments from transcriptions for personalized content delivery.
  2. No More Monkey Business

    By integrating LeMUR, the application elevates user interactions through:

    • Semantic Search: Users can search for posts based on content, topics, or emotions, thanks to the deep understanding provided by LeMUR.
    • Insight Extraction: Generates summaries and key insights from voice posts, enriching the user experience.
    • Personalized Recommendations: LeMUR's capabilities enable more accurate content recommendations based on user preferences and behaviors. (Note: The real-time streaming prompt is a planned feature but not yet implemented in the current version.)

Challenges and Solutions

  • Real-Time Audio Processing:

    Ensuring minimal latency between recording and transcription required efficient handling of API calls and data processing.

  • Emotion Detection Accuracy:

    Fine-tuning the integration with Hume AI's API to accurately detect and reflect user emotions.

  • User Privacy and Security:

    Implementing robust authentication, data encryption, and secure storage practices to protect user data.

Outcomes and Benefits

Sayings. aims to revolutionize social media by:

  • Enhancing Communication:

    Bringing back the emotional depth of conversations that text alone cannot provide.

  • Fostering Understanding:

    Allowing users to hear the tone and emotion behind messages, leading to better empathy and connections.

  • Personal Growth:

    Providing personality insights and feedback to help users understand themselves and others better.

Building Sayings. has been an exciting journey. By integrating advanced AI technologies and focusing on voice interactions, I've created a platform that brings a new dimension to social media. While the project is still in its early phases, I'm committed to refining and improving it based on user feedback. I have personally built this on my own with no team members whatsoever.

You can try out Sayings. at sayings.me. Please note that the website is yet to be fully optimized on the backend, and updates are underway to enhance performance and user experience.

Thank you for your understanding and for the opportunity to participate in this challenge.

Thanks for reading and I look forward to your feedback!!!

Top comments (0)