Hey, I'm Matt — creator of Kollabe, a digital whiteboard tool used by teams for retrospectives and planning poker. Recently, I tackled a small but tricky issue that reminded me how real-time apps can often have hidden quirks. It was a classic race condition where under just the right circumstances, some users didn't show up on the board—even though we knew they were there.
Before I dive in, I want to emphasize that this scenario was seemingly pretty rare and quickly resolved. But it was an interesting puzzle that illustrates an important lesson in real-time app development. Let's unpack what happened, how we discovered it, and the simple fix we landed on.
The Race Condition 🏃🏃
In most sessions, everything works seamlessly: a user joins, the retrospective updates in real time, and life is good. But in a handful of cases, especially when multiple team members joined at the exact same moment, we saw a timing issue: a user's presence wasn't reflected on the board.
When Does This Happen?
- User Opens the Retrospective: We fetch the board's initial state from our API.
- Board Renders UI: The user sees the layout, existing cards, and other participants, based on the snapshot in the database.
- WebSocket Connection Establishes: Real-time events start flowing.
If someone else joined during the tiny window between Step 1 and Step 3, our WebSocket might've missed the "user-joined" and other events. This rare timing gap meant occasionally the new arrival wouldn't appear on someone else's board.
Events 3 and 4 are missed, because they are sent after the snapshot is capture, but before it is returned from the server, and before the websocket connection is established
Why Not Just Connect the WebSocket First? 🤔
A logical idea might be: "Let's open the WebSocket before fetching the board data, so we never miss a beat." In our scenario, we have other responsibilities:
- Permission Checks & Session Setup: We have to confirm a user is allowed to join and set up relevant session data.
- User Experience: Opening the WebSocket prematurely can add unnecessary overhead or even delay the moment the user sees the board. We want to show the UI as soon as possible (Step 2) instead of waiting on a WebSocket handshake.
Given these realities, we decided not to flip the order.
Our Simple, Reliable Fix 🔧
Instead, we built a small "event cache" that tracks any incoming WebSocket messages until the UI is ready to process them. Unfortunately this means we are fetching our state a second time, but it supports a seamless user experience, especially if our WebSocket connection has to re-connect.
- Initial Fetch: We start by fetching the board’s state as before, so users see a current snapshot right away.
- Connect WebSocket & Queue Events: Connect to the websocket and start adding events to our Queue.
- Second Fetch: After the WebSocket connection is established, we fetch the state again. This ensures we haven’t missed any updates that occurred between the first fetch and the WebSocket going live.
- Process the Queue: Finally, once the second fetch is done, we replay all queued messages in the correct order—making sure every participant’s actions are accurately reflected on the board.
By combining a second state fetch with queued event processing, we guarantee no updates slip through the cracks—regardless of timing.
The Ideal Solution 💡
In a perfect world, the sequence might look like this:
However, real-world constraints like permission checks and user experience considerations led us to our caching solution instead. Rather than trying to perfect the timing, we simply cache events until we're ready to process them.
Alternatives 👀
It might be worth mentioning that our solution isn’t perfect, and there are definitely other approaches you could take.
Server-Side Message Replays
If your WebSocket solution supports replaying missed messages, you could drop the second state fetch entirely. Once the WebSocket is connected, the client requests the backlog from the server, which replays the events that occurred during the gap.
Short-Lived Event Buffer
Another option is to store events on the server in a short-lived queue. When a client connects (or reconnects), it automatically receives any queued events since its last known state. This shifts the responsibility from the client to the server, requiring additional server resources but potentially simplifying client logic.
We chose not to use these approaches because they introduce additional complexity and overhead on the server side. In our current architecture, a second state fetch is a simpler, more straightforward solution that builds on existing REST endpoints and keeps most of the real-time logic in the client.
Key Takeaways 🔑
Since implementing the event cache, we've seen:
- Fewer Support Tickets: The rare "invisible participant" scenario vanished.
- Better Reliability: Even if someone's connection hiccups, we know the events will queue up until the app is ready again.
- Peace of Mind: We are no longer worried about out of sync state.
It's a subtle improvement, but in a real-time product used by thousands of teams, these small fixes add up to a more polished experience.
Wrapping Up 🎁
Real-time collaboration is powerful, but it's also full of hidden details that can catch you off guard. Our event cache solution was quick to implement and ensured we never lose events—even in that tiny timing gap when a user joins.
If you're looking to run slick retrospectives or planning poker sessions (without invisible participants!), check out Kollabe. It's a robust, production-tested platform that's constantly evolving to create the best real-time collaboration experience possible.
Thanks for reading!
Top comments (1)
Hey nice article, our teams use your planning poker tool.