Bloom Filters

#bloomfilters #database #caching #bitarray

Imagine walking into a mobile shop that only sells Samsung and OnePlus phones. You ask the shopkeeper for an iphone. Without wasting any time, he says, “No, we don’t sell iphones.” Now, you change your mind to buy a One Plus model(a rare one). Instead of denying straightaway, the shopkeeper responds, “It might be in stock, but I can’t be 100% sure until I look.”

This is similar to how a Bloom Filter works, it can definitely tell you if something doesn’t exist(like an iphone in an android shop), but if it says something might exist, there is a certain uncertainty of the data’s presence.

To keep it more simple, Bloom Filter can give you a false positive(i.e. return true for something that is false). But it can never give you a false negative(i.e. return false for something that is true).

We all have encountered those warning messages while signing up on some platform that says, “Username already taken/exists”. That’s Bloom Filters in picture(not always as there are some more methods like querying the database or cache but that becomes inefficient when the user traffic explodes).

Let’s understand this Data Structure bit by bit:

A Bloom Filter is a probabilistic data structure. Now what does that mean?

Probabilistic in context of data structures means that this data structure provides answers with a certain possibility of being correct. This filter can definitely say when an element is not present inside it, but it might be wrong in some cases when it says, “Yes, I have this data”.

A bloom filter uses a bit array to store the data. Now when I say it stores the data, it doesn’t store the actual raw data but applies some hash functions to those data sets and accordingly marks the corresponding positions in the bit array. For example, if you want to check if “Kartik” exists inside a bloom filter it will apply multiple hash functions to this input and then mark their corresponding indexes as 1.

Let’s break this down:

Input — ‘Kartik’ to be added inside a bloom filter.
Hash functions applied: Since a bloom filter uses multiple hash functions, we will consider it has 3 for better understanding.
Positions to be marked: Since this filter uses a bit array(a relatively large size bit arrray) where all indexes are set to 0(switch off) by default, based on the hash function’s result the positions will be marked as 1(switch on).
That’s it. The data is stored now. Go ahead to check if it is present or not.
Below is a visual representation of hash functions being applied to an input:

Now, the outputs of the above hash functions(3, 5, 9) are the positions that need to be set. Below is its visual representation:

As you can see the positions 3,5 and 9 are set to 1.

Checking if the data is present in the filter or not:

You give the input. Let’s say the input is again “Kartik”.
The same hash functions are applied to “Kartik.”
On applying the hash functions we would get the same result for obvious reasons which would fetch us the same positions inside the bit array i.e. 3,5 and 9.
The filter checks if these positions in the bit array are set to 1.
Oh, these indexes are already marked as 1(they are on). This means the data might be present.(Username already exists, please try using a different username).
However, if any of the hash function results point to a position that is still 0, the filter can confidently say, “No, the data is not here.” This is because the same input, when hashed, will always give the same result, so if those positions aren’t set, the data was never added.

A Bloom filter can give false positives (saying something might exist when it doesn’t), but it can never give false negatives (saying something doesn’t exist when it actually does). So, when you check for “Kartik” and all bits are 1, it means “Kartik” might exist. If even one bit is 0, the filter is sure it doesn’t exist.

To minimize the number of false positives you carefully need to decide the length of the bit array to be used(within your system’s capabilities) and also the number of hash functions.

Increasing the number of hash functions exponentially would result in slowing down the process but decreasing them won’t increase the speed but would result in an increased number of false positives.

Read more about how to calculate the above specifications here

That’s all that I know about Bloom Filters as of now. I have used it in my personal projects but not on big scale ones so the choice of using this data structure should be entirely yours. I’m attaching a link where you can experiment on this data structure and find out how many times you get a false positive?

Here you go:

https://llimllib.github.io/bloomfilter-tutorial/

If you read it till here, I’d like you to please share your story in case you used this data structure in your experience. If not used, do consider giving it a try and if you liked this post, please like, comment and share to increase the reach. You can also follow me for more such content. I post on weekends. Thanks :)

DEV Community

Bloom Filters

Top comments (0)

Read next

Load Balancing Algorithms with Examples

🌐 Comprehensive Guide to Building a Production-Level CI/CD Pipeline with Full-Stack Technologies 🚀

AI Interview Prep Helper with PostgreSQL, Ollama, and Open-Source Vector Extensions 🚀

How to use migrations with Golang