DEV Community

Cover image for The Birthday Paradox: A Statistical Breakdown and How it Relates to Online Security
Connor Anastasio
Connor Anastasio

Posted on

The Birthday Paradox: A Statistical Breakdown and How it Relates to Online Security

The Birthday Paradox is one of the most well known paradoxes and a staple in many Statistics classes. One common phrasing of it is this:

Given a room of only 23 randomly selected people, the probability that two of them have the same birthday is over 50%.

How is this possible? Does it even make sense? Thankfully, we already understand the math behind it and can prove this is the case. In other words, the only “paradox” here is that our brains have so much trouble understanding it.

Since our probability space contains only two possible outcomes (shares same birthday and does not) we can find the probability two people share the same birthday by first finding the probability that they do NOT share the same birthday and subtracting it from 100%. (For simplicity we will be ignoring leap days in our calculations. The end result is essentially the same; the math is just a bit messier.)

The key to understanding why the Birthday Paradox is true is that we are not looking for two particular people that share a specific date; we are comparing each person against all other individuals and looking for ANY date that is a match. Consider the simplest case with two people: the chance of not sharing a birthday is

36436599.726%\frac{364}{365} \approx 99.726\%

This makes sense, as there are 364 other days in the year the second person could have been born on in order to not share the same date. When we add in a third person, the chance becomes

(364365)×(363365)99.179%\left(\frac{364}{365}\right) \times \left(\frac{363}{365}\right) \approx 99.179\%

We are subtracting one from the numerator because the third person must not share a birthday with either the first OR the second, and is compared to both. This means each additional person we add has one less day in the year they can be born on in order for there to successfully be no matches with anyone else.

Continuing this way,
Five people:

(364365)×(363365)×(362365)×(361365)97.286%\left(\frac{364}{365}\right) \times \left(\frac{363}{365}\right) \times \left(\frac{362}{365}\right) \times \left(\frac{361}{365}\right) \approx 97.286\%

Six people:

(364365)×(363365)×(362365)×(361365)×(360365)95.954%\left(\frac{364}{365}\right) \times \left(\frac{363}{365}\right) \times \left(\frac{362}{365}\right) \times \left(\frac{361}{365}\right) \times \left(\frac{360}{365}\right) \approx 95.954\%

We can follow the pattern from these examples to derive a general formula for finding the probability of n people not sharing the same birthday:

364!(364(n1))!365n1\frac{364!}{\left(364 - (n-1)\right)! \cdot 365^{n-1}}

Using this for 23 people, we get:

364!(36422)!3652249.270%\frac{364!}{\left(364 - 22\right)! \cdot 365^{22}} \approx 49.270\%

Subtracting this from 100% gives us a roughly 50.7297% chance that two people share the same birthday. In other words, the chance of finding 2 random people in a room of 23 people with matching birthdays is about the same as getting heads on a coin flip.

A visual interpretation may be helpful here:

Image description

Recall that we are asking if ANY two people share ANY birthdate; with 23 people, that gives us 253 possible pairings to examine between just this small group. The chance jumps to over 99% at 57 people (1,596 possible pairings), and over 99.9% at 75 people (2,775 possible pairings)!

Image description

The solution to the Birthday Paradox has far-reaching implications, an important one being the field of cryptographic hashes.

The Birthday Paradox and The Birthday Attack

Hashes are a type of encryption that generate a sort of “digital fingerprint” for a file; they are generated by feeding the actual bytes that make up the file into a private algorithm that generates an output string. Since a file's hash is generated using the file itself, it can act as an easy way to verify that a file is truly what it claims to be. Additionally, hash algorithms are purposely structured so that making even tiny changes to a file will produce wildly different hash values.

Image description

There are many cryptographic hash algorithms in use today, the most common at the time of writing being MD5 (Message Digest Algorithm 5) and SHA-2 (Secure Hash Algorithm 2). SHA-2 is great. MD5? Not so much.

You may be saying, “Alright, sure, I guess. But how does this relate to internet security?” Well, hackers realized that the same line of thought used in understanding the birthday paradox could be applied to breaking certain forms of hash encryptions, most notably MD5.

On paper, MD5 sounds relatively secure. Being 128-bit encryption, it is capable of generating 3.4*10^38 possible unique hashes. Unfortunately, it turns out this is not nearly enough with the computational speed of modern computers. The breakthrough came when hackers realized that even with a number that large, the algorithm doesn't promise or guarantee that every single file will always have a unique hash. In fact this is actually impossible to be true, since files of any size can be hashed and the length of the hash (only 38 characters long) couldn't possibly be unique for every file. If someone made a program to generate millions and millions of dummy files and then compared their hashes they would eventually find a match. Finding any two hashes that were identical, regardless of what that hash actually was, they would be able to change information in one file with that hash and have it affect the other. This is known as a “hash collision.” They would then be able to take note of the exact changes to each hash after a change to the files and compare the two; this would allow someone to slowly figure out what the private hash algorithm actually is, effectively destroying its security.

The basic outline is this: find two hashes that are the same, regardless of what they may be, and you’ll be well on your way to defeating that encryption. Because we only care about finding any two files with matching hashes, and each file hash is compared to all others we have, the amount of files that need to be checked before finding a match is significantly smaller than all 3.4*10^38 possibilities. Sound familiar? This is known as using a "Birthday Attack."

In 2008, the CMU Software Engineering Institute concluded that MD5 was essentially "cryptographically broken and unsuitable for further use"; yet it is still the most common hash algorithm in use today. Even massive tech companies that should know better continue to use it: in 2012 Microsoft was exploited using MD5 hash collisions when hackers were able to generate as many counterfeit Windows SSL certificates as they wanted, indistinguishable from genuine ones.

Thankfully, we have much more secure hashing algorithms (such as the previously mentioned SHA-2) available to us, which will take much, much longer to break in this manner. So, at least at the moment, we are safe from The Birthday Paradox. And before you think about using MD5 to secure information on your website, remember it was defeated by a couple of birthdays and some basic statistics.

Top comments (0)