Every year tens of thousands of respondents trust the State of JavaScript and State of CSS surveys with their data, some of it quite personal and sensitive, and I'm fully conscious of the responsibility this represents.
So ever since starting to run the surveys, I've hoped that I would never have to write the dreaded "data leak" post. But sadly today is the day I need to address this issue.
TL;DR
An encryption key that makes it possible to decrypt publicly-available encrypted email addresses and link them to survey responses was mistakenly committed to a public GitHub repo.
Key Points
- This is a human error, not a malicious attack.
- The leak is now closed.
- You are concerned if you answered the State of JS or CSS surveys before and up to 2020 (the 2021 JS and CSS surveys are not affected).
- So far there is no evidence that the mistake was actually exploited, but I'll keep monitoring the situation.
- Passwords were not affected as they use a completely separate hashing mechanism.
What Happened
This situation resulted from three separate mistakes:
- I made the decision two years ago to add email hashes (or so I thought) to publicly available survey responses datasets (for surveys up until 2020; 2021 datasets were not published yet) in order to use it as an ID and make it possible to track how a given respondent's answers were evolving over time.
- An open-source contributor contributed the function that generate those "hashes" and used a 2-way encryption function. Somehow over time I made the assumption that it was instead a 1-way hashing function.
- About a month ago, another open-source contributor committed private credentials -which included the encryption function's encryption key– to a public repo while working on a separate project. Although the contributor noticed the issue and scrubbed the history right away, the faulty commit apparently stayed accessible by itself as a "ghost commit" outside of a branch.
Both because of the holidays, and because I didn't realize the consequences of the leak right away, the encryption key stayed accessible in theory for about a month.
What This Means For You
The risks to survey respondents are two-fold:
- Someone could use the dataset to generate an email list used for spamming purposes.
- Someone could link personal data (salary, etc.) to the email address you used.
Was the Leak Exploited?
The "good" news is that the repo the key was committed to is very low traffic and had no forks, watchers, or stars, making it less likely that ill-intentioned people randomly stumbled on the encryption key.
Moreover, even with the key in hand an attacker would've had to then figure out where the key was being used (which happens in a separate repo); what it was being used for; and where the relevant encrypted emails were made available; none of which is obvious unless one is already familiar with the project.
So while I don't have any way to tell with certainty if anybody actually went through the process of decrypting the encrypted emails and correlating responses with them, I personally think the probability of this happening is fairly low. But I apologize for not being able to give you more certainty.
Steps Taken
I've taken the following steps:
- Stop using the leaked encryption key.
- Make the repo private so that the encryption key is not accessible anymore.
- Take down the public datasets containing the encrypted emails until I can re-upload versions without them.
Note: if you happen to have a copy of the datasets or are hosting a mirror, please get in touch or delete your copies if you can!
In the future, I will also focus on making it possible to complete the survey without having to provide an email, which is something that survey respondents have often asked for.
Ironically enough, the leak happened in the process of migrating the survey app to a newer, more robust codebase in order to make it easier to change the way accounts work.
Going Forward
The surveys are an open-source project, created in the open by a mostly-volunteer group of contributors from around the world. And while this can sometimes make it tougher to properly coordinate and avoid situations like this one, I also think being community-driven is one of the project's major strengths.
So while it's totally understandable if a leak like this one makes you question sharing any data with us in the future, I hope you'll be able to give the project another chance.
And if you're not fully comfortable sharing personal information just yet, here's a reminder that you can always skip any question in any survey. Another thing that might put you more at ease might be to use an email alias that can't easily be tied back to you.
I deeply apologize again, and if you have any questions about this whole thing, just leave a comment here and I'll do my best to answer.
Note: I am very grateful to Troy Hunt for pointing me to this great article about the proper way to handle such matters. I recommend it if you ever end up in the same situation!
Top comments (34)
Thanks for being so transparant about this! I reckon most companies don’t even bother disclosing anything until they know for certain data was actually decrypted by someone. Hell I’ve seen companies actively downplay the severity of a situation even when they know for sure passwords have been leaked.
Well, without people’s trust the surveys can’t really work. So I’ve always tried to do everything in the open from the start. Thanks for the kind words!
Thanks for the transparency and clear communication. I would imagine it's a tough and nerve-wracking experience to post this article, so thank you also for your courage to show the (IMHO) right way to handle this.
A+++ would answer survey again.
Honest mistake, commendable recovery. (Who’s ever gonna misuse that data anyway. Let's hope only people who still use too many float:left's and too many !important's get spammed with beginner CSS tutorials! Sorry stupid joke.)
An e-mail address in combination with your development preferences could be used to target customized phising attacks agains Devs. We can be an attractive target, given IT is one of the best paying industries out there. That being said, we're also one of the most aware and thanks to Sacha's quick and honest reaction, we're now aware that things like that can take place.
"given IT is one of the best paying industries out there."
Lol, not where I work at... 😅🥲😣
Human errors happen all the time, unfortunately. On the other hand, transparency is a rare value, thank you very much for being worthy of trust because of your honesty!
+1000
Thank you for being so transparent and honest about this! Everyone makes mistakes and I appreciate the effort that goes into these surveys every year.
Thanks for your kind words!
Please ensure you consult experts on security and privacy before choosing a new approach, and also seek community feedback once you come up with a new plan.
For example, it’s not enough to simply use an ordinary one way hash of email addresses, because nothing stops an adversary simply applying the same function to some publicly known email addresses and looking for matches in your dataset. I suspect this is probably what the original developer had in mind when they chose an encryption function instead.
Yes, we will not publish hashes at all going forward. We do need to store one way email hashes privately for log in purposes, but they won’t be part of any public dataset.
This is a good reminder to not store encryption keys in a repo. Ideally use something like Hashicorp Vault, but at least don't store them in files within the repo.
Hosting systems like netlify, azure etc let you provide secrets via their UI and can be accessed from code through the process environment (process.env in node)
I'm not a huge fan of this solution either (it can lead to a lot of unsecure copy/pasting into Slack or Dropbox when you need to share the secrets, multiplying the number of places the secret exists) but it's true it would have avoided the problem in this specific case.
It always comes back to that human error of the postit on the monitor with password. Lol
what was the original motivation to « track how a given respondent's answers were evolving over time»?
Let’s imagine that in 2020 Famous React Developer Foo shares the survey and brings in their audience; and then in 2021 Famous Vue Developer Bar shares the survey and in turn brings in their audience. In theory you could have shifts in survey answers just because different people are answering the survey. By adding these ids my idea was that you could isolate a constant cohort of respondents if you wanted to remove the influence of audience shifts.
Thanks for the quick reply. Two concerns:
People generally have the expectation that such surveys are anonymous and that results are only gathered in aggregate. It is also safer. It looks like you have realized the value of these propositions.
So: Wouldn't you still be able to achieve your intention by survey respondents revealing their previous framework exposure? Like checking "I have mostly experience with..." React / Vue / Angular, etc. Then you could see the influence of audience shifts.
I don't think that achieves quite the same thing. I think the simplest solution is to have two datasets, one without any kind of identifiers for the general public and one with (secure) identifiers which we would only make available to data researchers who want to specifically do a cohort analysis if they get in touch with us.
What would be the limitation? Unless you actually want to model the relationships between the influencers and their audiences, I don't see how you actually need to track personally identifiable information to track trends in demographics...
If we want to track how cohorts evolve over time then we should just track that in a secure manner; or not track it at all if we can't do it right. It just seems like a simpler approach than finding some other more "fuzzy" metric to use as a proxy.
The question is if it's really necessary to track 'cohorts' per se? With the security risk and disfavored UX it entails. If you can get a decent enough statistic from other more aggregate means.
I appreciate the disclosure and the transparency, and I sympathize with the incident. However, I don't see key steps in this post that would make me trust the survey going forward. To be blunt, the fact that you mistook a 2-way encryption for a hash makes me think that you do not have the security expertise to be responsible for this data.
The "Steps Taken" section still talks about mitigating the encryption mechanism. Is that the same 2-way encryption that caused the issue? Why isn't the first step to remove the encryption mechanism and replace it with a 1-way hash? If you still need to continue using keys, is there a better option for key management than simply making the repo private? The "Going Forward" section doesn't mention security improvements at all.
Before I'd trust the surveys again, I'd like to see you talk about third-party security audits, and how you're going to verify security-related contributions going forward.
Thank you for the write up.
I don't think I received an email about this but a friend who also took the survey said he got an email about it. Were all participants emailed about the data breach?
Can you explain what a "ghost commit" is?
Yes, all participants were emailed. Maybe you unsubscribed from the mailing list in the past?
And as I understand that "ghost commit" was a commit that was not part of a branch or linked from anywhere on GitHub but still independently accessible if you had the direct URL.
You know you can use BFF, right?
docs.github.com/en/authentication/...