DEV Community

Cover image for Distributed Hash Generation Algorithm in Python — Twitter Snowflake Approach
Manan Doshi
Manan Doshi

Posted on

Distributed Hash Generation Algorithm in Python — Twitter Snowflake Approach

Into The Background…

In any software development lifecycle, unique IDs play a major role in manipulating and showcasing data. A lot depends on the uniqueness of each data segment associated with user transactions.

There are several ways to generate unique IDs, but when it comes to generating hashes at a rate of approximately 10,000 hashes per second in a distributed fashion, it becomes a computational challenge.

The Twitter Snowflake approach is a widely used concept in recent advancements in distributed algorithms. Several other approaches can be used for hash generation, which will be covered in a future article!

Some Knowledge on Approach…

To enable distributed hash generation, this algorithm employs the principle of “divide and conquer” on a 64-bit hash. The algorithm uses five different parameters to establish uniqueness while generating the hash and ensuring there are no collisions.

Twitter Snowflake Hash template

The five parameters are as follows:

  • Timestamp (41 bits): This is the binary version of the total milliseconds from the difference between the current time and the default epoch set by the party developing this algorithm. (In our example, we have set it to Twitter’s default epoch, i.e., November 4, 2010, at 1:42:54 UTC.)
  • Datacenter ID (5 bits): This is the binary version of the Datacenter ID, which can accommodate 2⁵ datacenters, totaling 32 data centers.
  • Machine ID (5 bits): This is the binary version of the Machine ID, which can accommodate 2⁵ machines per datacenter, totaling 32 machines per datacenter. (Virtual addressing can be introduced to increase the count, but currently, the algorithm limits it to 32 machines.)
  • Sequence Number (12 bits): This is the binary format of the sequence number, which counts the number of hashes generated by a machine/process within any millisecond. The sequence number resets to 0 after each millisecond. After each ID generation, the sequence number increments by one if the request is received within the same millisecond.
  • Sign Bit (1-bit defaults to 0): The one extra bit in the system is reserved for any custom case while using this algorithm on a large scale. It can be used as parity, to distinguish between signed and unsigned numbers, to set flags, or for any future uses. Each machine in the system involved in generating hashes can run the following templates to generate a hash:

Let’s Code…

While writing the code for generating a hash, if you don’t have a distributed architecture or the resources to implement one, you can set the default bits of ‘Datacenter ID’ as ‘00000’ and ‘Machine ID’ as ‘00000’. This scenario is also reflected in the following code:

from datetime import datetime, timedelta

class GenerateHash:
    def __init__(self):
        self.sequence_timestamp = datetime.utcnow()
        self.sequence_number = 0

        # Setting default epoch to 04 November 2010 at 1:42:54 UTC Time
        self.time_ref = datetime(2010, 11, 4, 1, 42,54)

    def generate(self):        
        ## Sign bit: Will be useful for future references
        sign_bit = '0'

        ## Datacenter bit (Default Datacenter: '00000')
        datacenter_bit = '0' * 5

        ## Machine bit (Default Machine: '00000')
        mac_bit = '0' * 5

        ## Timestamp bit                
        # Getting Current Time in utc
        time_now = datetime.utcnow()

        # Time Difference since default epoch
        time_diff = time_now - self.time_ref

        # Total time_diff in miliseconds
        ms = int(time_diff.total_seconds() * 1000)

        # Miliseconds conversion in binary to 41 bits
        ms_bin = format(ms, '41b')
        timestamp_bit = ms_bin        

        ## Sequence bit
        # Time difference since last request on this machine
        time_difference = time_now - self.sequence_timestamp
        # If time_difference > 1 ms: reset the sequence number
        if time_difference > timedelta(milliseconds=1):
            self.sequence_number = 0

        # Set the sequence timestamp to current requests timestamp 
        self.sequence_timestamp = time_now

        # Joining all strings 
        # Also formatting self.sequence_number int -> binary of twelve bytes
        final_bin = sign_bit + timestamp_bit + datacenter_bit + mac_bit + format(self.sequence_number, '12b')
        # Replacing all spaces with zero as bit
        final_bin = final_bin.replace(' ', '0')

        # Incrementing the sequence_number
        self.sequence_number += 1

        # Converting final binary to decimal and decimal to hex value
        hex_val = hex(int(final_bin, 2))
        # Avoiding extracting string 0x.. generated due to hex() function
        final_hex = hex_val[2:]
        return final_hex

hash_generator = GenerateHash()
new_hash = hash_generator.generate()
print(new_hash)
Enter fullscreen mode Exit fullscreen mode

Here we have used the ‘datetime’ library to support time operations since we use universal time to employ machines/servers globally.

The Benefits, Concerns, and more…

Benefits First :)

  1. We generate alphanumeric hashes in a distributed fashion ensuring consistency and scalability across our systems.
  2. These hashes are timestamp-based, allowing us to track their creation time, sort them accordingly, and trace clicks on features, among other uses.
  3. Not only can we control when these hashes are generated, but we can also customize how they are generated. This flexibility enables us to modify or update our hash generation methods as our systems evolve.
  4. This algorithm supports the generation of at least 10,000 hashes per second.

Some Concerns :(

  1. The hash IDs are timestamp-based and include specific data center and machine IDs, which makes them vulnerable to the prediction of future generated IDs.
  2. Due to the predefined guidelines for hash generation and the lengthy hash string, there is limited scope for customization. This constraint can restrict the algorithm’s utility and adaptability in various systems.

Something more to focus on…

  • In our current algorithm, we do not have built-in support for clock synchronization in a multi-machine environment. We currently assume that our ID generation servers have synchronized clocks. However, the optimal and widely adopted solution would be to implement the Network Time Protocol (NTP), which we will cover in future articles. So stay tuned…

Thank you so much for getting this far!

Top comments (0)