Distributed File Storage Architecture (eg. Amazon S3, Azure Storage)

#distributedsystems #techtalks #systemdesign #softwareengineering

Welcome to another Sunday blog, where we explore fascinating concepts in distributed systems and their real-world applications!

There may be scenarios where the server needs to store several large static files like documents, videos, audio, etc. In such cases, a distributed storage system is necessary. We will explore the architecture of Azure Storage. Files stored in these managed systems can be accessed via a URL. We will also examine how durability and high availability are maintained.

Blob Storage Architecture

Azure Storage (AS) consists of several clusters deployed across different geo-locations. Each cluster contains multiple racks of nodes (nodes refer to individual systems or machines). Each rack is an independent unit, and the nodes within a rack have data replicated among them. This ensures that, in the event of a node failure, the rack continues to function, maintaining availability.

A file in blob storage is accessed via a URL. At a high level, the URL comprises two parts: the account name and the file name. The administrator of the blob storage can configure both the account name and the AS DNS to identify where the file is stored. A cluster uses the file name to determine which rack and node contain the file.

There is a central location service. The function of this service is to allocate or reallocate a cluster for an account when it is created or reconfigured. These allocation algorithms may consider factors like load and distance. The central server also configures the DNS to map the account name to the public IP of the cluster. The cluster configuration is updated to start accepting requests for a particular account.

The following figure (taken from the book Understanding Distributed Systems) shows how a lookup for a file works. In the figure, a cluster consists of three layers. We will examine these layers one by one.

Stream Layer

The stream layer consists of several units of replication. These are essentially chain replications performed synchronously whenever a file is written. When a file is uploaded, additional units are added to the chain replications by the stream manager. The stream manager then returns the list of storage servers/nodes where the replications are stored. This data can be cached on the client side so that future writes directly target the leader or master node.

Partition Layer

The partition layer contains the partition manager, which maintains the index (a pointer to chain replication nodes with their exact location) and metadata (file name and account) of every file in the cluster. The partition manager also handles load balancing, remerging, and dynamic repartitioning when required.

Front-End Layer

The front-end service, functioning as a reverse proxy, is a stateless component responsible for authenticating requests and routing them to the appropriate partition server based on the mapping maintained by the partition manager.

Conclusion

Azure Blob Storage leverages a multi-layered architecture to ensure durability, scalability, and high availability for managing and accessing large-scale data efficiently.

Here are some links to my previous posts, which I publish every Sunday on distributed systems:

Feel free to check them out and share your thoughts!

DEV Community

Distributed File Storage Architecture (eg. Amazon S3, Azure Storage)

Blob Storage Architecture

Stream Layer

Partition Layer

Front-End Layer

Conclusion

Top comments (0)

Read next

🚀 20 Programming Languages That Changed Their Original Names 👇

Patterns of Enterprise Application Architecture-Day 1

Cache Strategies: A Complete Guide with Real-Life Examples 🚀

DevFest Toulouse 2024