Design a Ticket Booking Site
When developing distributed systems, it’s essential to design with a clear focus on several critical goals: supporting resource sharing, ensuring transparency in distribution, maintaining openness, achieving scalability, and avoiding common pitfalls that could compromise system performance. These goals are particularly important in high-demand applications like a Ticket Booking Web Application, where users expect seamless experiences despite varying loads and complex operations.
In this case study, I will explore how I addressed these design goals in the development of a ticket booking platform. The system not only supports the basic functionalities of viewing and booking events but also meets stringent non-functional requirements like high availability, low latency, and scalability under extreme conditions. Through a combination of innovative solutions and industry best practices, I ensured the application could handle millions of users during peak times, maintain consistent data, and deliver fast search results, all while safeguarding against the challenges inherent in distributed system design.
** Functional requirements:**
- Users should be able to view events
- Users should be able to search for events
- Users should be able to book tickets to events
Below the line (out of scope for our discussion)):
- Users should be able to view their booked events
- Admins or event coordinators should be able to add events
- Popular events should have dynamic pricing
*Non-Functional requirements: *
- The system should prioritize availability for searching & viewing events, but should prioritize consistency for booking events (no double booking)
- The system should be scalable and able to handle high throughput in the form of popular events (10 million users, one event)
- The system should have low latency search (< 500ms)
- The system is read heavy, and thus needs to be able to support high read throughput (100:1)
Below the line (out of scope for our discussion):
1.The system should protect user data and adhere to GDPR
- The system should be fault tolerant 3.The system should provide secure transactions for purchases
- The system should be well tested and easy to deploy (CI/CD pipelines)
- The system should have regular backups
The functional requirements refers to what Users should be able to do.Non-functional requirements define the system's quality and how it should function.Our application has clients who communicate with our servers through API gateways to perform operations like search, event CRUD, booking, and Stripe payment. Our servers also interact with our database. Let's take one of the functional components: Search. In our initial design, to search for anything, we would have to traverse the entire table in our database to find it. The execution time of this type of search is directly proportional to the number of items in our table, leading to delays. In the above diagram
** A better solution would be to query a search engine like Elasticsearch**, which builds an inverted index to make searching documents by term much faster. For example, if we have a popular search query, we can tokenize that string or set of strings, create terms from them, and map those terms in a hashmap of sorts to the documents or queries where they appear. For instance, "Westlife": [event1, event2 …] or "playoffs": [event3, event4 …]. This approach makes searching for any term and retrieving results super fast.
Now the issue arises: how do we send data to our database and Elasticsearch while ensuring that the data remains consistent, available, and fault-tolerant at all times? One approach is to send the data synchronously to the two databases simultaneously, but the problem with this is that if one system fails completely, we can no longer persist data, or worse, even if we do persist our data, we may end up with inconsistent data. A better approach is to use Change Data Capture (CDC), which captures changes in our primary data store and puts them onto a stream to be consumed asynchronously. For this, we use Debezium as a CDC and Kafka as a streaming service.
Kafka itself operates on a consensus-based algorithm, KRaft. The leader appends incoming log changes (AppendEntries RPCs) with new log entries to its followers (state machine). Kafka takes this data into a log and appends it to the primary database and to Elasticsearch. This ensures that in the case of failovers, the data entries are still available in the log to be consumed, maintaining data consistency and availability at all times.
With CDC and streaming, we also need to account for edge cases. For instance, if we have a surge of requests for a big event, we need to protect our servers. This can be managed by implementing data batching:instead of sending individual requests to elastic search for each change we accumulate a batch of changes and send them in a single bulk request. This reduces the number of HTTP requests and minimizing the network overhead and improves throughput., filtering, deduplication: the consumer can filter out unnecessary changes or deduplicate events before they are batched, rate limiting, or throttling the consumer service also manages how fats data is sent to elasticsearch to avoid overwhelming the cluster.
This search solution has successfully met our requirement of avoiding low-latency search, ensuring super-fast search performance.
We can further improve search performance by deploying a Content Delivery Network (CDN) between our client and API gateway. This CDN caches API endpoints and results for a short time, delivering them to users during network surges in the same geographical location. This is particularly useful during high-traffic periods, such as Black Friday sales or popular events. In cases where many users are searching for the same exact term, the CDN can return cached results immediately, significantly speeding up the search process. However, this approach is only effective if the users' searches are geographically close to where the surge occurs.
A downside to this approach is that it becomes less useful as the permutations of search queries increase. Additionally, if our system evolves to provide personalized recommendations, a CDN may not suffice. This is because, with a CDN, every user making the same API call receives the exact same result, which would not be appropriate for a recommendation system.
Let's take another part of our system: scalability during surges for popular events and the ways we can handle ticket purchases. For really popular events, tickets could go stale quickly. One approach is to find a way to hold tickets for some time, especially when they have been reserved, we use a ticket lock for this. When new users want to view the remaining tickets, we will fetch the tickets from our database, compare them to our ticket lock, remove the ones present in our ticket lock, and then send the remaining as available tickets. If the time of a ticket in the ticket lock expires, it is removed from the ticket lock, and the next time a ticket is viewed, it will be present as available. We can also create a WebSocket or Server-Sent Events (SSE) to inform users in real-time about the available tickets left, tickets booked, and the number remaining, all in real-time.
To protect our server and better balance the workloads on our servers and database during big occasions like Black Friday or, in our case, big events, we know that in such scenarios, our servers can easily be bombarded with requests, which can lead to catastrophe. To prevent this, we can set up a choke point between our API gateway and server. We use a queue like Redis sorted sets, where API requests are held and executed periodically. The requests are held, and let's say 100 are sent to our database at a time, and so forth. This helps reduce the write load on the database, thereby reducing the need for replication to handle this situation. Of course, this choke point should be admin-enabled for special occasions like big events; otherwise, it will just delay requests on a normal day.
Since reads are greater than writes, we can reduce the read load on our database by caching certain operations that rarely change, especially since event names, venues, and performers rarely change. It makes sense to cache reads on these in a Redis store and ensure the cache is invalidated or updated on changes to the event name, venue, or performer. By doing this, we significantly reduce the load on our database while making reads on event names, venues, and performers incredibly fast. These are the techniques and design patterns I take into account as a software developer when designing systems, especially distributed systems, to ensure openness, scalability, resiliency, fault tolerance, and high availability of resources.
In summary, designing a robust and scalable Ticket Booking Web Application involves addressing both functional and non-functional requirements through thoughtful architectural decisions. By integrating Elasticsearch, we achieve low-latency search capabilities, enabling rapid retrieval of event data even under heavy loads. The use of Change Data Capture (CDC) with Kafka ensures that our data remains consistent and available, even in the face of system failures, by capturing and streaming changes asynchronously to both the primary database and Elasticsearch.
To further enhance performance during high-traffic periods, such as major events or sales, deploying a Content Delivery Network (CDN) allows us to cache API responses, delivering them quickly to geographically proximate users. This approach significantly reduces server load and improves response times during surges.
For ticket booking, implementing ticket locks prevents double bookings by temporarily reserving tickets and updating availability in real-time. Additionally, by introducing server-side queuing, we can manage write operations more efficiently, preventing server overload during peak times. Caching frequently accessed data, like event details, in a Redis store further reduces the strain on our database, ensuring fast read operations.
These strategies collectively contribute to a system that is not only scalable and resilient but also capable of delivering a seamless and efficient user experience, even in the face of high demand and complex data requirements.
Top comments (5)
If we rely on ticket lock which is a separate storage, there is a chance the status is out of sync from the DB which is the source of the truth of the availability of ticket. So I think we should add an optimistic lock in the ticket table to handle double-booking updates landing at DB due to inconsistency of cache data.
agreed. If you land on a boundary then we also will want concurrency protection on the DB. Thought I mentioned that but you may have missed it.
Your guides are some of the best I've come across—truly top-notch. One question I have: how do you keep the event data cache in sync with the available tickets? Does the booking service send requests to the CRUD service to update the ticket information in the database, which then updates the cache?
no , you wouldn’t do that! Managing it that way would be a headache. Instead, when someone views an event page, you'd quickly fetch the event details from the event cache. For ticket availability, you'd separately query the ticket table to check available tickets and cross-reference them with reserved tickets in the lock.
In practice, this would involve two API endpoints: one for quickly loading event details to ensure a smooth experience, and another, which might take a bit longer, to return the available seats.
Why not keep a copy of booked tickets in Redis? Since you have to query Redis anyway to get the reserved tickets. Then you don't have to go to the DB for availability.