Don't deploy on Friday afternoons!
This expression is taken as programmer wisdom but I hate it. I'm going to try and kill it, with words and experience.
The motivation behind it is sound. I don't want to spend my Friday nights debugging a production problem either.
To me the expressions smacks of unprofessionalism. Software development as an industry has a poor reputation and phrases like this do not help.
If you went to an ER on a Friday afternoon and was turned away because the doctors don't trust their own tools and know-how, how would you feel?
If we want people to take our craft seriously we need to own it and not give the impression that we dont understand the systems are we making enough to make changes at the end of the week.
Why a lot of people don't want to deploy on Fridays
- You don't understand your system. You're making changes but you're not 100% sure it'll be OK. Ask yourself why this is. Why are you scared of changing your software?
- Poor monitoring. If the users are the first to tell you if something is wrong, that feedback loop spills into time at the pub, rather than when you deploy.
- Overly-complicated rollback/branching. How is your disaster recovery? How easy is it to push a fix once you solve a bug?
Continuous Delivery (CD)
I have worked on teams that have deployed new versions of various services in a distributed system multiple times at 4:30pm and not broke a sweat.
Why? Because deployment is fully automated and is a non-event. Here is the groundbreaking process.
- Write some code
git commit -am "made the button pop"
git pull -r && ./build && git push
- Make a cup of tea
- It's now live.
Not so long ago it was considered normal for there to be releases every 6 months, or even just one at the end of a project.
The forward thinkers in that age saw problems with this
- Poor feedback loops
- Stressed development teams
- Catastrophic failures.
So the industry as a whole worked on lots of tooling, techniques and best practices to allow us to release software far quicker.
Recognising that releasing often reduces risk is generally accepted nowadays but teams still often settle on weekly or fortnightly releases; often matching the cadence of their sprints.
What are the problems with weekly/fornightly releases?
- The feedback loops are still not great. If you do your release there can be quite a lot of commits going live and if something is wrong it can be challenging to figure out exactly what broke. Especially if you wrote it 2 weeks ago.
- Still overly reliant on manual processes. I have seen teams actually skip a release because a QA was on holiday. This is surely unacceptable in 2018. Manual testing does not scale into the future. People leave, things get forgotten, etc.
- Let's you fall into the trap of writing stories that are dependant on other stories being finished in a "sprint". When they aren't things can get very complicated.
With CD we recognise that we can go further, deploying new software to live every time the build is green. This has some amazing benefits,
- Extremely fast feedback loops. No longer do you have to think about code you wrote 2 weeks ago when there is a problem in live.
- Forces best practices. In order to be able to deploy to live on green you need excellent monitoring and tests. These are all good things in their own right.
- Reduces stress. "Releasing" is no longer a thing any more. You can be confident in writing your software again!
- Vastly improves agility. Found a bug? Just fix it! This encourages a more lean way of working vs lots of upfront planning. There isn't even an option for a convoluted release process, you have to keep it simple.
- Forces you to work on stories that are actually releasable. Not dependent on story x y and z. Forces the best practices on user stories that everyone acknowledges but often people ignore.
But what if you break things?
Often people say with CD
Yeah it's nice but what if it breaks? We should have a QA check things over
Here's the thing, no process in the world prevents bugs. You will ship broken code. What's really important is how quickly you can detect and recover from it. Hoping manual testing will catch everything is wishful thinking.
How to CD on a new project
It is much easier to do CD on a new project since you can start small and evolve.
Generally your work should be centered on delivering the most valuable user journeys first, so this is an excellent chance to practice how to ensure that feature works without any humans checking anything.
- Write an end to end test. These are expensive to write and run and should only be reserved for your most important journeys
- Have monitoring with threshold alerts for when things go wrong
- Set up your pipeline so that when your code is pushed all the automated tests are run, if they pass go to production.
- Have some kind of green/blue release mechanism. Run your automated tests on the deployed release candidate and if they dont pass, dont ship it.
For each subsequent story ask yourself
- How will we know this is working? (monitoring)
- What tests do I need to have enough confidence this will work without any humans checking. Not every story needs a full end-to-end test on a huge distrubuted system but obviously you'll need some tests.
- Is this story as small as it can be? If your user stories are massive they are more likely to go wrong. If the story takes a week then that's back to slow feedback loops.
- If you cant answer these questions then you need to rethink the story. Notice these are all just basic agile principles for user stories. Continous delivery forces you to adhere to the principles that so often get ignored
How to CD on an existing project
Peel away at the manual process
- You may have some kind of "run book" that is used when shipping the software. See what you could do to automate it.
- Find out all manual processes are happening. Ask why they are needed and what could be done to automate them.
CD up to staging.
Some companies have many environments in their delivery pipeline. A good first start is to automatically ship all the way up to the environment before live. A better step is remove as many of them as you can. It's ok to have some kind of "dev" environment to maybe experiment with but ask yourself why cant just test these things locally in the first place.
Identify a sub-system you could work with as a starting point
If you're working with a distributed system you might be able to identify a system which is easier to CD than the rest. Start with that because it'll give your team some insights into the new way of working and can help you begin to break the cultural barriers.
CD is a cultural issue as much as a technical one
Roles and responsibility
Often a product owner or project manager wants to be the one who is in charge of releasing.
There are circumstances where exposing features to users should be controlled by a non-technical member of your team, but this can be managed with feature toggles.
But the copying of code from one computer to another is the responsibility of the developers on the team. After all we are the ones who are responsible for making sure the system works. It is a technical concern, not a business one.
What do QAs do now?
CD is actually liberating for QAs
- Rather than spending their time manually testing poorly tested systems they can now focus on a more holistic view of the system, trying to facilitate an environment for CD so that the whole team can be confident things are working working
- QAs spend more effort helping developers define what needs to be tested and monitored for a story to be written.
- More time for exploratory testing
Re-evaluate your tolerance for defects
Lots of companies think they cannot have any defects and will spend a lot of time and effort on complicated, time consuming (and therefore expensive) processes to try and stop them.
But think about the cost of all this? If you push a change to production that isn't covered by tests, perhaps a CSS change; consider if it's really catastrophic if there's a small visual fault for some browsers
Maybe it is, in which case there are techniques to test specifically for this too.
Recovery
Each release you do with CD will have the following qualities
- Plenty of tests
- Good monitoring
- Small scope
- Still "fresh" in the developer's mind
So in my experience fixing anything that falls through the cracks is easy. It's much less complicated than trying to look through 2 week's worth of git history.
I would recommend in most cases not rolling back (unless it's really bad), but just fixing the issue and releasing it. Rollback is sometimes not an option anyway (e.g database migrations) so the fact that your system is geared to releasing quickly is actually a real strength of CD.
Other quick tips
- Fast feedback loops are key. Measure your
git push
to live time and keep it low. - If things are getting slow re-evaluate your end-to-end tests. If you removed a test or refactored it to be a unit test, would you be any less confident? If not, then refactor away
- You may need to invest some time in making your components more testable to avoid writing lots of slow end-to-end tests. This is a good thing.
- Feature toggles are a useful tool but can become messy, keep an eye on the complexity
- Celebrate it. Moving from one release every 2 weeks to 50 a day feels great.
Wrapping up
This has been a small introduction to CD, it's a huge topic with plenty of resources to investigate.
Continuous delivery is a big effort both technically and culturally but it pays off massively.
I have worked on distributed systems with over 30 separately deployable components with CD and it has been less stressful than another project I've worked on that had just a handful of systems but a ton of process and ceremony.
Being able to release software when it's written puts a higher emphasis on quality and reduces risk for the team. It also forces you into using agile best practices like testable, small, independently releasable user stories.
Maybe most importantly, it demands that you understand your system and that you're not deploying to production and crossing your fingers. That feels more professional to me.
Top comments (39)
A great article, thanks. I'm working on a project where agile has become "we do one release every two weeks" which always gets delayed because everything gets tested the day of the release or the day before. It's actually painful and not agile at all. I often don't even remember what has been waiting in the staging pipeline for 2 weeks 🤔 I feel like we can do better but the biggest hurdle is that agile has been translated in 2 weeks sprint and nothing else. It's funny that we became rigid with a methodology that's supposed to be flexible and adaptable.
At least we have continuous integration...
There have been lots of anti agile post lately. One good bit of advice is to simply decouple project processes and technical processes as much as you can get away with. Some organizations just are going to reinvent waterfall no matter what label you slap on the process.
Just because you are doing scrum to plan what you do doesn't mean you can't do CD. If you want to stop time once every two weeks and celebrate all the stuff that went live over the past two weeks, fine. But don't block deployments on some poor overworked PM staring at an issue tracker. That's the opposite of agile. Instead, it ships when it is better than what you had; if it isn't don't merge it. If what you merged is not good enough or can be improved, change it some more. That's called iterating; it needs to happen more often than once every two weeks to get good results.
Cant upvote this enough. I tried to argue that a lot of what CD advocates is what agile is meant to be. You know, being able to respond to business needs as simply and easily as possible!
Either way, decoupling technical and business processes is a good first step. Does your PO/Scrum master/whatever need to know when you copy code from one computer to another?
@jillesvangurp I have nothing against agile :-D just that in this particular case of this particular project it has just become another buzz word that didn't change the status quo much.
I agree, but in this case it's the PM that's also doing the tests that's blocking the release cycle, not the devs :-D It's a combined problem of lack of resources (no QA/test team) and not applying agile at all. Anyhow we can stil try to do CD, it's a cultural problem as Chris said.
There's your problem: you have a human in the loop. You can't do CD with humans in the loop. The whole point of CD is to rely 100% on your automated tests. Your PM is part of the problem, not the solution. If he insists on signing off manually on everything before it ships that means there's a lack of trust for people to do the right things. That's bad; call it out.
When it comes to acceptance testing, your PM should be doing that in production and keep on insisting on incremental improvements until the feature can be signed off on. Likewise, the pressure should be on developers to not merge things that aren't production ready and to keep production functioning correctly. If unacceptable/poorly tested changes are landing in production regularly, that means your PM messed up planning the feature or the team messed up building the feature. Either way, that needs to be fixed if that happens regularly but not by crippling CD. If that happens, you have a competence problem, not a process problem.
The longer PMs drag their heels, the more expensive deployments get. The cost of a deployment grows exponential to the size of the delta. The price of shipping a 1 line change: almost zero. The price of shipping a delta that touches double digit percentages of your code base: weeks of testing; integration headaches; haggling; over time, etc. It gets ugly very quickly. Add to that the cost of not having features that were ready to go weeks ago adding to your revenue. The economics here are brutal. Not doing CD is very expensive. You end up wasting lots of time and money.
The opposite is also true, the cost of fixing issues resulting from small deltas is very low. So, shipping lots of small things ASAP lowers costs by decreasing testing effort, maximizing time to market, and decreasing the cost of defect fixing.
Thank you!
Thanks for the kind words.
It can be tough that's why I felt it was important to mention it's a cultural challenge just as much as a technical one. You have to try and convince the right people that releasing more often with less manual process is less risky and scales better.
Good luck!
I agree, the hard part is to find a better release schedule
thanks
Testing day before release makes assumption there are no issues to fix.
Our team got around it by adding +50% time to sprint for fixes or adjustments. If there are no issues to fix, we have this already booked time to deal with tech debt or fix punch of smaller issues that never get highest priority on their own.
In our case what happens is that the "release day" is not a fixed day because there was too much to test and this is a problem that speaks for itself :D
I'll work on changing the approach!
We struggled too, as our tester was part of another department, hardly ever available. Once we managed to explain management it's a bottle neck, and it didn't happen over night, eventually we got dedicated tester for our team. Think it was around two years ago. It has made a massive difference, we get our feedback very quickly, no slipping deadlines, everyone's happy :)
I know several experienced people across different companies that have gotten rid of their staging server and now do CD straight to production. I'm considering the same as it mostly just create deploy bureaucracy.
Nice comments
I would argue that you should be able to test this locally. When we run locally it runs every schema change script on a local containerised DB. When you make a change just TDD the change like normal.
Sure works for toy databases. But that alter table that works fine locally might completely cripple your production DB. Once it does, all the fixes are ugly. Some things just resist closed world assumptions.
+1 to this and not just the obvious alter tables.
EXACTLY
That's why you don't ship broken code on a Friday afternoon. Because people are less available and you want 0 chances of getting a problem while an important stakeholder is already in a plane for the week-end.
The point is if i ship something at 4:30 on a friday it will be a small change backed by excellent tests and monitoring. If something goes wrong it will be trivial to fix.
If it's something I wouldn't spot easily im just as likely to ship something on Thursday that's not spotted until the weekend.
Bunching up changes (and forgetting about them) for Monday is more likely to cause problems.
Not sure that holds to my experience. We deploy dozens of services multiple times a day; many by CD. We also deploy on Fridays but try to avoid afternoon deploys.
There are less eyes on things at the weekends so subtle problems get missed and you sometimes get rushed work as folks try to squeeze that last thing out to production before the weekend.
We have excellent monitoring and logging; we aren't afraid of breaking things but the human cost of having to debug something at beer o'clock on a Friday is something we don't advocate.
I work at a company that has a CI/CD pipeline, we routinely deploy 10-20 times a day, including on Fridays...but we still say this occasionally. It's a matter of context. There are lots of deploys that we don't do on Fridays precisely because if it's that easy to deploy, what are the odds that you've got a critical yet risky change that a) didn't make it through your pipeline and b) must go out now, at Friday at 4pm?
I don't think this is the right analogy. How about:
If you went to an ER on a Friday afternoon, would you want to be seen by the surgeon who's finishing her shift of the one who's starting hers? Also, how would you feel about seeing the one who's finishing first, and having her hand you off to the one who's starting?
Before you answer, here are some scary numbers: statnews.com/2016/02/01/communicat...
It's always a matter of context :)
I'd say that if the context of your deployment makes it risky, then it's a bit of a smell. I'd start trying to work out why it's risky to deploy this software, and how I can reduce that risk, rather than avoiding it.
Clearly changing a typo in a greeting message, or adding a single line of CSS to adjust a margin on something is always going to be less risky than an functional update that adds a new feature to a product/service. With all the best efforts to reduce the risk, there is always going to be a difference in risk level for any single piece of work, just like there are differences in the effort of work for any given user story. Taking this into account, deployments are always contextual, and like smells (code or otherwise), that's not necessarily a bad thing. It's just a single measure of risk, and something that shouldn't be taken in isolation.
While I don't disagree that CD is good, that isn't a good argument to purposefully release software with decreased availability. I understand that any type of issue should be quick to diagnose... but why? It can wait.
You said it best:
From start to close, the whole article seems like it is looking for validation from some onlooker to say "wow you really ARE professional, aren't you!". Who cares? Someone judging you based on your professionalism won't have any concept of what CD is in the first place.
That is really not the impression I'm trying to give. What I've failed to explain is CD results in more reliable software as it stops you taking shortcuts. It forces you to address shortcomings in your testing/monitoring/etc if you're going to be able to release when code is checked into master.
Again, miscommunication on my part I suppose. I would prefer to raise the game of teams I work on by emphasising that releasing and crossing our fingers so much that we dont want to release on Fridays is not good enough.
IMHO it's also down to how your userbase prefers to get updates. Sure, no user wants to use buggy code, but bundling and explaining changes in regular interval could also be better use of their attention span.
There's a separation though between releasing and delivery.
Ideal world in terms of simplicity delivering features as they're ready (continous delivery)
However you can still shipe your code and hide new features with toggles. This means your code is always deploying, so you get all the benefits of feedback loops but you control what the user sees. The downside is feature toggles have a cost in terms of maintainability and effort.
Yeah I think that after a threshold of complexity and size of user base basically every new feature will have to be behind a/b test or feature toogles.
Oh yes, someone finally said that! I've been deploying on Friday afternoons for more than two years but everytime there is someone (from different company or even different team) who says it's unprofessional. Same as you, I think it's the other way around - it's unprofessional to not be able to deliver when it's needed/ready. Improve your process instead of trying to find excuses. Or at least stop pretending you're doing agile.
Thank you for sharing your insights and experience. I think you can't stress enough three importance of good automation and solid tests for your codebase to be able to release confidently.
We do not deploy on Fridays though for two reasons: the first reason is that we have a desktop software which needs to be installed by the users. So if we push something and find a bug afterwards, we can't just update the css and after a page refresh everything works. The users will have to install the software again, which will annoy them if it happens too often.
The second reason is for our backends. We have tests and automation in place, but if something should go wrong, it can get pretty expensive pretty soon. So if you say they people should be professional like doctors or lawyers, this would mean that they should also be held responsible alike. So if a lawyer makes a mistake that costs his client €10k, he should better have an insurance to cover the costs. What happens when we as developers make such a mistake? Should we be responsible and compensate the loss? This would be a strong motivation to get it right with every deployment, but it would probably be unrealistic to happen. The minimum would be to go and fix stuff even in the afternoon or on the weekend if it breaks. Good monitoring does not help if it goes to your work inbox and will not be noticed till Monday.
Please mind that I don't want to oppose your arguments for confident releases with a good CE pipeline, but I think there are still good reasons to be careful with the deployment time. I'd rather wait till Monday to deploy to keep and eye on it than push it live and leave for the weekend, even if it does work most of the time.
You make some fair points, I tried to communicate it requires you to evaluate your risk tolerance. I would content most companies dont need to spend so much time/effort on weekly (or whatever) releases in the name of not shipping bugs; but there are of course exceptions. Certainly desktop software has very different shipping requirements!
I don't get it. After the release, if there's a bug, you need to release next version anyway and people would need to re-install the software again. So what's the difference if you release on Friday, Monday 6 AM or Christmas? I always thought it is easier with regard of deploy time for "boxed" products rather than continuously delivered services.
The point was going in the direction of "there will be bugs anyway, so you don't need a QA before the release" (which was not meant this way, I know!). Therefore, I was bringing up a different aspect for desktop software. Of course, if there is a bug, it needs to be fixed, but it can also make sense to wait and collect a few bugs and fix them a few days after the release instead of doing a minor release per bug that was found (depending on the severity).
The main difference about the timing is: If you publish software with a critical bug on Christmas eve, you might not spend Christmas with your family because you are hotfixing. If you do this two days before Christmas or just after New Year, you might spend one or even three days fixing that bug as well, but the bug has your full attention during your regular working hours.
Of course, if a developers does not care, this developer would also not spend Christmas fixing a bug, but I assumed that people reading here is more of the kind that takes developing to the heart ;-)
Yeah right. Everithing cool with more testing, a better CI CD Life cycle. I don't like to spend my weekends debugging some random thing related to an important new feature that can't be turned off because of business reasons. I would deploy a poc, a Friends and family on friday afternoon though. It is not lazyness. It is the preasure of knowing that the business is loosing millons of dollars every hour and it is your responsability to fix it and you had a couple of drinks because it is saturday night, and the other persons that were responsible for the oncall can't get reached. Is prioritizing my me time. On my weekends I code other stuff and I rather do that than the things I do 8 hours everyday. On my weekends I disconnect from Office and I engage with my family.
You start off by comparing apples and oranges. A standard code release into production is more like a scheduled operation, walking into the ER is like applying an emergency hotfix. It's also fairly typical for hospitals to not have operations over weekends and last thing on Fridays for the same reason most developers don't release code on those times: there's not enough staff on hand to support those things.
I think it's unreasonable to expect any single dev to understand 100% of a medium to large system. Bear in mind, these things are built by teams, and one product or service might actually have hundreds of microservices behind supporting it. Any single one of those could go wrong in a myriad of ways. Even Rain Man would have trouble keeping up with that.
Well, that's just patently wrong. A good suite of automated tests supported with good manual QA will prevent bugs. It won't stop everything, but it's far better than not testing.
Some things just can't be easily tested locally, and some things shouldn't be tested by the developer who wrote them. Having testing environments that very closely match production is vital for proper continuous delivery.
While you still have some form of argument for deploying late in the day and just before the weekend, this only works for
Anything else means that you care enough about the project that you don't want to release bugs that could have been caught with a few hours more testing, or that you really aren't 100% sure about the scope of your changes. That latter is ok; as a developer you should be confident that you're not causing side-effects with your changes, but sometimes there can be no guarantee.
My point is for most teams we dont need this overhead, in fact the additional overhead of process and delayed feedback loops makes it even more likely your system to be problematic.
I agree, but for a given story a developer should have 100%(ish) knowledge of the surface area of what they're changing. Remember agile advocates for small stories so I dont think we're asking too much here. In addition, everything else should be tested so there is very low risk in that respect.
I think you may have misinterpreted. What I am trying to say is no matter what you do, you will still probably encounter bugs; even with automated tests, manual checking etc.
Re your other points, this approach is successfully used on large systems. Go to any devops-themed conference and they emphasise that this, along with the best practices i described (monitoring, observability, testing, small releases, etc) is the way forward. Re "production like staging environments" I recently went to a conference where the talk had a lot about how the industry is kidding itself if it thinks it can be "production" like in non prod; particulary with distributed systems and the unpredictability of user behaviour. Our time is much better served optimissing for MTTR (mean time to recovery) enterprisedevops.org/article/devop...