One of the hardest problems that I’ve ever dealt with in my life is making ECS deployments with Cloudformation work and work nicely, which is hard for some reason that only AWS knows.
TLDR
If your cloudformation stack update containing ECS doesn’t finish in a reasonable amount of time(depends on previous deployment times and container image size), roughly 10-15 minutes. You have a problem with your service.
Why does it stay stuck?
Idk, it just does, this is one of those things that cloudformation(any deployment tool) really sucks at and CloudFormation waits for several hours before rolling back to a previous configuration. If the issue that's causing stack failure continues during stack rollback to a previous configuration, then the stack gets stuck in UPDATE_ROLLBACK_IN_PROGRESS status. Finally, the stack changes to UPDATE_ROLLBACK_FAILED status and that is DevOps hell.
What’s really happening though?
ECS ***********really*********** likes to be stable and if it can’t be stable then it won’t finish updating(it won’t tell cloudformation that the update finished) and it won’t rollback either, so the main goal is to get the service to reach stability and there can be many reasons for it not reaching stability, however that is beyond the scope of this article.
So what could’ve gone wrong?
Here are some common reasons why an Amazon ECS service can fail to launch new tasks:
- Container image issues
- A lack of necessary resources for launching tasks
- A health check failure on a load balancer
- Instance configuration or Amazon ECS container agent issues
An Amazon ECS service that fails to launch tasks causes AWS CloudFormation to get stuck in UPDATE_IN_PROGRESS status, and you can quick check this by going into the service and selecting deployments, and checking the status of the latest deployment.
So what do I even do now?
Delete the stack and recreate!
The main goal is to get your cloudformation to say:- “Update Complete” and it can take the AWS CloudFormation stack several hours to stabilize. To stabilize your stack more quickly, just lie to it.
Disclaimer
The following resolution is intended to help you stabilize an AWS CloudFormation stack quickly without waiting for the stack to time out. The resolution isn't intended for production environments, as the Amazon ECS service is out of sync with the known state of AWS CloudFormation. To sync resources between your Amazon ECS service and the AWS CloudFormation stack, you must perform an error-free update on the stack.
Resolution
Change the desired task count of the Amazon ECS service to 0
- Open the Amazon ECS console.
- Choose your cluster.
- Select the service, and then choose Update.
- Set Number of tasks to 0, and then save the configuration.
So what do I then?
Get your service to run and stabilize, make sure the process can start.
Top comments (1)
To me this is a big issue with cloudformation. For instance the terraform resource aws_ecs_service has an argument wait_for_steady_state that can does the trick.
For cloudformation i have previously ended up implementing this myself by instead of using the ecs construct then using a custom resource that simply wraps the ecs cli and can then do the deploy of the ecs and rely on circuit breaker to rollback instead of cloudformation rollback if health checks fails.