Gather 'round and let me tell you about that time my teammates and I took down one of the platform's most critical services.
For the sake of narrative, I’ll name my coworkers S (senior) and M (middle), although I changed the dates and some of the technical details in an effort to keep things as anonymous as possible.
T’was a lovely Thursday afternoon at the end of July. Weather was milder than the days before and my teammates and I had just moved into a new space in the office.
Life was mostly good.
The week before had passed without the presence of our most senior dev, S
, but M
and I had managed to complete some tasks on our own.
The changes we introduced had some pretty heavy stuff (new database fields, new endpoints) so we decided to wait for S's review before we merged them.
Database changes and migrations 📚
In the meantime another dev reviewed our changes and mentioned that the operations in our migration scripts could be a lot of work for the database server and could keep it locked for a while. So he suggested some alternatives.
If you don’t know what a database migration is, it’s a series of SQL scripts that — when executed sequentially — create the whole database schema from scratch (tables, columns, default values, constraints, etc.). Usually they run on every deployment, but only if there’s something new to apply––that is, a new migration script that has never been run.
We didn’t want to run our change as a migration because then it would be executed by our continuous integration system which would be completely out of our control if something went wrong.
We had to run it manually.
So, we sat down with a replica of the production database and started playing around with queries trying to find the most efficient alternative.
This was a fun while, the 3 of us trying to guess what would happen when we ran each script line separately. Would it take long? Or run instantly? And why? Where should we set the default value to prevent new entries from breaking it?
Showtime! 👯♀️
Now it was 5 PM and we were finally happy with our query. It did everything in the right order and didn’t really lock down for very long. We sent a ticket to the infrastructure team to run it directly on our servers, since obviously we didn't have credentials for that. You’ll see exactly why in just a moment.
Infra was really diligent about this. We know they’re a busy bunch, and it’s okay if they take a while to do stuff. But not this time, oh no, this time the ticket was closed before I could blink twice.
Everything seemed ok. It was almost the end of the day so we split into our own desks and finished some tasks.
Some minutes later, a member of another team came by and asked me if we were changing stuff on that feature because it was throwing some errors. This hit me like an ice-bucket-challenge:
Oh no, I mean, yes… we f***ed it up.
Can you guess what happened? I know it’s harder to find omissions than to find mistakes, but I think you could be able to know what we needed to do (and didn’t) without knowing the service, the feature, or the queries.
If you do, scroll to the bottom and comment it, I’ll trust you’re not peeping below this photo 👇
Alright, if you have no idea what could have happened, that’s OK too. I feel I should have known, because this happened locally during development like a million times. And I had been starting a very similar feature on my own that day and dealing with the same error.
Ready for it?
Reality hits hard 👊
We added the column in the database, but we never merged the code that handled this new data. This resulted in the application not knowing WTF to do with those fields, and throwing errors on all GET requests 😱
Now, this might not have happened in some languages, which probably would have discarded the extra, unparsed data and moved on. I shouldn't disclose any more information about this, but I’ll say that we were using a technology that’s not as chill about rogue database columns, it demands that you tell it how to handle everything it receives.
Not merging the code for the new fields wasn’t the problem though. Here’s where our little evil villain comes in: meet *
😈 malefic, echoing laughter
See, all of our retrieval queries were using the *
(known as an asterisk or star), which means results will include every field on the table. This was the day I learned this is a very very very bad idea.
If we had had a list of the fields to retrieve instead of *
nothing bad would have happened. But we didn't. And all 3 of us knew it. It just slipped our minds 😰
The real, real problem 🤦🏻♀️
OK but, we’re humans, right? We make mistakes. No-one can keep all of the codebases in a company in their working memory. This is why we have tests. Not because our code is fragile and error prone, but because our minds are. And it’s our minds that create the code.
So the problem wasn’t really that we broke it. The problem was that no alarms went off.
We had production tests, but they weren’t notifying anyone. We had error alarms in the logs, but they weren’t monitoring the thing that caused the most errors. We had end-to-end tests, but they were disabled and marked as “broken”. This is what was truly unacceptable 🍋
Epilogue 📜
Ultimately, though, it was a big opportunity for learning: you never learn as hard as when you f*** up. That’s a lesson that never leaves you.
So, for you children out there, here’s my list of learnings from that day:
- NEVER, I MEAN NEVER use * in a query in your code. Even if you have to write one hundred column names, it's better like that.
- Be scared of database changes, may fear guide you into caution.
- Migrations should be decoupled from deploys, in order to provide better control over them.
- When you introduce a database change, make sure all the code is already in place and that it can handle the previous db state, and the one after the change.
- Have happy-path tests in place running on production for your critical features, working and notifying the right people.
- Test database changes as soon as they are impacted and monitor the logs for a couple of minutes later (depending on the amount of traffic that you get).
- Aaand brush your teeth before you go to sleep 😬
Disclaimer ⚠️
This is a story told by the less experienced and knowledgeable of its characters. If you have any input about it, to add to or correct what I say I would really enjoy reading it, but please be nice about it. 🙂
Credits
Cover Photo by Sebastian Fröhlich on Unsplash
Photo by 胡 卓亨 on Unsplash
Thanks to Nicolas Grenié a.k.a. @picsoung and Eva Casado de Amezua for their reviews and input.
This story was originally published in HackerNoon.
Top comments (41)
Good article, but I disagree with your bullet "migrations should be decoupled..". This decoupling is what caused the situation, the schema version and code version moved independently. The desire to have "control" over your migrations sounds like you don't trust your delivery pipeline. Most migration tools have an up and down or roll forward revert path for this very reason. If you need control you aren't really doing migrations you're just calling manual schema updates "migrations". And yes don't use *, your DBA should have yelled at your team long ago about that.
That is what I was going to touch on too! I have worked on software that tightly couples the code to the db migrations and it has worked very well so far. The application just waits for all migrations to finish before trying to start the main execution path, so long-running schema changes are pretty painless.
Hey! Yeah, it's not that we didn't trust the pipeline, it's that adding fields with constraints and default values on a huge... HUGE db used by a system with heavy traffic would have been a mess. In the migration we don't usually care about optimizing for performance, we just wanna get the correct schema for the correct code. But if you have a situation like this where you should minimize locking operations as much as you can then I feel it's safer to run it manually.
The thing is that by trying to avoid downtime, we ended up causing even more downtime. But I will modify that bullet point to be more specific about our case.
The migration aspect (new columns with a default value) is one that I keep having to remind my fellow developers of as being a bad idea.
There are ways around that with creating a new table with the default column and backfilling the data from the renamed table. It is also possible to first update the schema and then update the table in batches of say 100 to avoid locking the entire table, finish that up with setting the default value when all records have been updated. The third option would be to just update each row as it is fetched from the database if the value is nil.
I am perplexed how it seems that every company I work for is having this problem. I've resorted to every possible way of solving this. Also took down our entire stack one time at the worst possible time. We lost a lot of money that time, never forgot about default values again. I suspect you will have an eye on this in the future :)
One item I should add that I've seen overlooked on teams is the DB schema can progress without the code as long as the code handles each schema state. E.g. you check in code that adds a single column (in this case of you're Selecting * you may need to update a DTO or something) then deploy. That deployment would only do a portion of the migration you ultimately want. So a large disruptive DB change may take 5 deployments or 10 whatever makes sense. This approach still captures every single state of the schema in source control. Similarly, this is how I recommend destructive changes to the schema. Push a deployment that stops relying on a column then drop that column in a subsequent deployment. This way WHEN a migration doesn't run or does something weird you're less likely to explode. Anabella, in your case we're all speaking in generalities and it's easy to critique without knowing the details. Its possible, given all the information I wouldn't have a better solution. I liked your article and you have guts putting yourself out there.
Hopefully yours is not a culture of blame.
In order for this migration to work, you would have had to be extremely careful to get it exactly right. If it blows up, it's easy to say that you should have been more careful. And of course, you say that you'll be more careful next time.
That sounds good, but an environment in which everyone just has to be extremely careful or everything blows up is going to face catastrophic failures often. I've been there.
Developers make mistakes. That's reality. It doesn't have to break production systems if we understand it and work with it, not against it.
That's one reason why we write unit tests for our code. We catch little mistakes. If we didn't make small mistakes, then why write unit tests?
We also test the behavior of our applications, including QA tests. If we never introduced bugs, why test our applications?
Then we deploy our code. That can be a complex process, as you experienced. Do we test that deployment process? We test everything else, so why would we not test the part that has the greatest potential to crash everything?
That means having a staging environment which perfectly mirrors production, where we can test our deployments. Containerization also helps.
Was it your personal idea (or S's or M's) not to have a staging environment for such tests? I bet it wasn't. The risk of catastrophic failure was built in to your deployment process. It was a matter of time.
"Let's all be more careful" is not an answer. I doubt that you or anyone else were careless. Humans simply aren't capable of mentally calculating every variable and visualizing how complex systems will behave. (We do okay sometimes, but relying on that is a horrible idea.) If we could do that then we wouldn't need computers. We could just do everything in our heads.
Testing everything, which includes having the environment for such tests, is the answer.
Wise words! Loved your comment, it's almost like a mini-article on its own 👏
We did introduce a lot of monitoring and testing after this happened but we still do not have staging as a part of the deployment process. We did have lower environments but they're used more in an "ad hoc" manner. So that means they were used for testing whenever someone acknowledged it was necessary to test something there first... And this might have been the biggest problem.
A very nice article Anabella.
Unexpected things happen, so one must try as much as possible to guard against them with reviews and tests, but unfortunately we can never guarantee a bug free code (or unhackable system).
Well, next time you have a migration or a big change to do, I would suggest:
All the best Anabella and make sure to discuss with your team (dev, ops, qa, ...) what my happened and how to prevent it in the future.
Thank you for your advice and your kind words :) Nice to find some honest and nice feedback when talking about f* ups.
I agree with your points and we did talk extensively about what happened and introduced a lot of testing and monitoring to prevent it in the future. Luckily neither the company nor the team are guilt-oriented and we just focused on learning and prevention <3
I wouldn't really consider this a screwup on your part. Sure, you probably didn't do everything perfectly, and the senior dev probably should have caught the potential problem. What went wrong, though, was a failure of the development process and the accumulation of technical debt. You should never have been able to introduce a problem on that scale to begin with. You might have still managed to introduce a bug, but it should have been minor at worst.
Ouch! I hope recovery was possible without excessive downtime...
In case people haven't met it already, the Refactoring Databases book by Pramod Sadalage has a number of strategies for this and other similar situations. The website has pretty much all you need to know, unless you like paper copies, and/or would like to support Pramod :)
databaserefactoring.com/
I personally like the use of migration scripts, tested & deployed with the code through the standard pipeline (perhaps as part of a management application). Decoupling when they are applied and leaving control of that to the appropriate ops team keeps the humans happy too.
Unfortunately the biggest "mistake" is architectural - your systems seem to require an outage to ensure code and migrations are in place before bringing the system back online. This may not be the fault of anyone - legacy systems are what they are and may never support in-place updates.
Ideally your migration code sits in the same repo as your code, so the same version will deploy the right scripts alongside the correct code. With that in place your team can start building automation around your deployments.
You really should have a complete dev environment with a DB schema matching production to test out your migration on before deploying to production. And you should have automated unit tests to smoke test the code after the migration completed; the select * is just one of many things that could have gone wrong.
THIS! Never has this ever happened to me when you make sure your dev env matches your prod. And if it does? there is a sql statement/prod etc to ensure the tables match (if that's your db, of course)
This comment really struck a cord that you need to match your environments.
If you ask your Senior, he/she will probably tell you the pain that not using
*
implies. Add a new field to the table, and search for every single query that might need it. Call to a method of your object, just to discover that this method depends (directly or indirectly) on fields you don't retrieved. And it goes and goes... it's a permanent maintenance hell, don't do it.The good solution for this case is to prepare the code for work with or without this field; and the good enough for many cases (depending on how critical are the affected requests) is to be ready to trigger the deploy as soon as the migration is completed, accepting that some errors will be triggered in the meantime.
Take also into consideration that using
*
may save you problems when you add a new field (at a huge cost, as explained), but then you'll find yourself renaming or removing fields... and in those cases, your field lists not only will not save the pain of the deploy, but also add an extra pain to the code changes.Love the article! I don't have much experience with database migrations, so I will have to keep this stuff in mind
One though I had, and this is really just me thinking out loud: with the ease of access to cloud database solutions, would simply creating a brand new database from scratch be effective? Add the changes and migrate the data, then when you are ready all you have to do is change the database endpoint and shut down the old db
Thinking about this as I write, I don't think this would be viable with extremely large datasets - but there are managed solutions from AWS and others to help out with migrations, and some services don't charge for internal data exchanges (for instance, if both the databases are on the same server)
Sounds good on paper, but just that.
Creating a new database and moving all the data can take hours, days, even weeks. Also if database is write heavy, data has to be synchronized right before the switch. This creates even more complexity.
I've been dealing with this for past 4 years and sometimes even a new column with a default value can become a headache.
Not to mention that databases usually have more than just data. Replication, subscription to data, CDC solutions, permissions and more. Stuff that's hard to automate or difficult to version.
You never go full *