Rafael Milewski

Posted on Nov 23, 2024

EchoSense: Your Pocket-Sized Companion for Smarter Meetings

#devchallenge #assemblyaichallenge #ai #api

This is a submission for the AssemblyAI Challenge : Sophisticated Speech-to-Text and No More Monkey Business.

What I Built

I developed EchoSense, a portable hardware device that captures spoken content in settings like meetings, classes, brainstorming sessions, and conferences. It features a web interface with real-time transcriptions of everything echoing through its microphone. Users can ask questions about the discussion or generate summaries in real-time, making it an invaluable tool for live events.

The device operates on a modest 40MHz SoC with 4MB of RAM. It’s lightweight, efficient, and can run on a tiny lithium battery, making it highly portable.

Tech Used

Vue, TypeScript, shadcn/ui
ESP32, Rust, Espressif IoT Development Framework (IDF)
WebSocket, SendGrid, AssemblyAI

Demo

Since this is a hardware device, providing a link to a demo isn’t feasible. However, I’ve recorded a video showcasing it in action, along with instructions on how to build one yourself.

and here is the GitHub repository with the source code:

milewski / echosense-challenge

Making sense of echoes and delivering insights

EchoSense

Portable device for real-time audio transcription and interactive summaries.

This is the main repository for my submission to AssemblyAI Challenge.

Esp32: The firmware source code for the ESP32-S3-Zero device.
Frontend: The UI that communicates with the device via websocket.

Each subfolder includes instructions for running the project locally.

For a more detailed overview, including screenshots, you can read the submission sent to the challenge here:

https://dev.to/milewski/echosense-your-pocket-sized-companion-for-smarter-meetings-3i71

View on GitHub

Screenshots

Journey

When powered on, the device automatically connects to the configured Wi-Fi network and requests a temporary token from AssemblyAI, valid for one hour. It establishes a real-time transcription WebSocket connection and generates a local network URL, displayed as a QR code on the OLED screen.

The QR code directs users to the device’s IP address, where a web server runs on port 80. The server hosts a Vue.js-based interface, with all assets (CSS, JS, images) inlined into a single minified and mangled HTML file.

This optimization ensures minimal memory usage—essential in a resource-constrained environment where every byte counts.

As the user speaks, audio is streamed in ~500ms chunks, sampled at 16000Hz in PCM16 format, via the WebSocket connection to AssemblyAI. Transcriptions are returned and displayed live to any user who scans the QR code. Simultaneously, the audio is saved locally on the device’s SD card for further use.

The following diagram illustrate this functionality:

Prompts Qualification

My submission qualifies for 2 prompts:

Really Rad Real-Time
No More Monkey Business

Incomplete functions

The SD card was initially intended to store recordings and later attach them to emails. However, I realized that file sizes would grow too large, exceeding email attachment limits. To address this, a backend would be required to receive and convert the files from raw PCM16 to MP3. Since this wasn't the main focus of the challenge, I left this feature unfinished, as it would require building and hosting a backend.
Currently, there’s no way to configure Wi-Fi, API keys, or recording options via the web UI. All keys are injected at build time during compilation. Ideally, users would set up the device via a local Wi-Fi connection between their phone and the device, but this setup would require additional work.
I had planned to design and 3D print a case, possibly as a cube, to align with names like MeetingBox or MetaCube. Unfortunately, I didn’t have time to complete this, so the prototype was built and presented on a breadboard.

If anyone has any question, feel free to ask below or open an issue on GitHub—I’ll be happy to help!

Top comments (17)

Ritesh Hiremath • Nov 24 '24

Nice work Rafael!!

Rafael Milewski • Nov 25 '24

Thanks! 😎

Samuel Jesse • Nov 24 '24

Incredible!

U G Murthy • Nov 23 '24

@milewski This is very creative. Fantastic work. Love it.

Rafael Milewski • Nov 25 '24

Thanks! ✨

Duc Nguyen Thanh • Nov 26 '24

good job bro

Hugo Antonio • Dec 5 '24

Wow, nice.

Felix • Nov 24 '24

wow , as a beginner as i am , your project is awesome , how did you come up with such idea?

Rafael Milewski • Nov 24 '24 • Edited

When I saw the sponsor's name, "AssemblyAI," it immediately made me think of embedded hardware... Assembly -> Low-Level -> Hardware. So, I decided to use some modules I had on hand and quickly brainstorm ways to make the most of it..

Felix • Nov 24 '24

WOW you did such amazing work , if i wanna learn those IOT , which language should i start with?

Rafael Milewski • Nov 25 '24

I would say there are two ways:

You can go with C, your learning path will be easier, and you would benefit from tons of libraries for every module you can buy.

Or you can go with Rust. It’s definitely going to be a much harder journey, but in my opinion, it will be a much better deal for your future.

And in terms of platforms, there are three major ones: Arduino, STM32, and ESP32. Arduino might be the most popular, but its boards are expensive and have very low specs compared to ESP32, which is cheaper and, in most cases, offers much better hardware for the price.

I have a repository on GitHub where I documented my journey while learning this stuff. It might be helpful for you to take a look and see how much you understand or can follow by reading the code, so you can get an idea of how easy or hard it may be..

github.com/milewski/sensors-esp

Felix • Nov 25 '24

Thanks a lot, what about using python?

Rafael Milewski • Nov 25 '24

It is possible to use Python, search for MicroPython... However, keep in mind that these devices have very limited memory. Using a language like Python, which is interpreted and includes a garbage collector, will consume most of the available memory, leaving very little for your application. The best approach is to use compiled languages with no runtime overhead, such as C or Rust.

To give you an idea of the challenges you might face (which are often unnoticed on conventional computers), I was unable to use the Get Transcription API because it returned a massive JSON response. This exceeded the memory available in my stack, and I only discovered the issue near the end of the project. As a result, I had to pivot my approach and use the Streaming API instead.