I'm Sergii Sergieiev, an Engineering Manager with over 20 years of experience in software development. I've spent most of my career in big tech, and for the past four years I've been working at Manychat.
In this article, we'll explore why CI/CD observability are so crucial for the development process and look at tools that can help improve them.
What is CI/CD
First, let's cover the basics. CI/CD stands for Continuous Integration and Continuous Deployment. These techniques and practices are designed to enhance your development process, ensuring the seamless integration and deployment of new features.
When I first joined the company, deploying was a one-step process that took about 10 minutes. Curious about whether this was too long, I asked my teammates. One casually replied, "It's just long enough to grab a coffee."
Now, think about it: is 10 minutes too long for your build? What about 5 minutes - does that feel more reasonable?
At first glance, these times might seem fine. But the real question isn't just about the minutes - it's about what they represent for efficiency and performance.
…when you can measure what you are speaking about, and express it in numbers, you know something about it.
William Thomson, 1st Baron Kelvin
Observability allows us to monitor and measure various aspects of our CI/CD pipeline, helping us identify bottlenecks, optimize processes, and ultimately deliver software more quickly and reliably.
Key steps of our CI/CD
- Preparation: npm, environment, etc.
- Build phases
- Testing: Unit, UI, and End-to-End (E2E) tests
- Deployment
We should measure and understand the time and resources deployment costs us. It's essential to assess the deployment process, including the time spent and the speed of execution, to identify areas for improvement.
DORA Metrics
The DevOps Research and Assessment (DORA) team focuses on four meaningful metrics in the CI/CD process:
- Deployment Frequency (DF) measures how often a team successfully releases to production. High-performing teams typically deploy more frequently, ranging from multiple deployments per day to a few per week, indicating agility and responsiveness to the market and customer needs.
- Lead Time for Changes (LT) measures the time it takes for a commit to be deployed into production. Shorter lead times indicate faster delivery of value to customers.
- Time to Restore Service (TRS) measures the time it takes for the team to recover from a failure in the production environment.
- Change Failure Rate (CFR) measures the percentage of deployments causing a failure in production that requires immediate remedy (such as a rollback or a hotfix). A lower change failure rate indicates better quality control and higher reliability of the deployment process.
A look back: the Manychat build process
How we started: long run, > 10 min
Four years ago, the Manychat build process had a single large step taking over 10 minutes with unpredictable timing. If something went wrong, we couldn't identify exactly what caused the problem.
We needed to go into logging and invest significant time in investigating issues. Without breaking this step into smaller parts, fixing issues could take multiple attempts, each consuming about 12 minutes. In the worst case, it might take an hour just to fix the CI/CD pipeline.
Additionally, we couldn't run parallel jobs due to limitations in the tools, which were based on an outdated version.
Although it wasn't my main focus, I began investigating tools that could fix this problem, and help us gain better control of our process.
Implementing GitHub Actions
We use GitHub for development, and the first tool we considered was GitHub Actions. Upon implementing GitHub Actions, we immediately saw our first results: the ability to run parallel jobs.
Perks we've got
- Improved speed line: We can run multiple jobs in parallel, eliminating the need to wait for tasks like type checking to finish before starting the build process.
- Visual flow: Allows us to identify issues and pinpoint time-consuming tasks more effectively.
- Custom runners: By using GitHub runners, we can create local runners with substantial resources.
Later, GitHub introduced the ability to easily configure runners based on resource consumption, which we leveraged to avoid running our own runners on separate infrastructure.
Before vs. after
After setting up this configuration, we gained several key benefits. Firstly, we now have clear timings for each CI/CD step, allowing us to see exactly how much time is spent on each meaningful stage. This step separation enables us to segment the CI/CD process according to our needs.
With this detailed insight, we can focus on the most time-consuming steps to identify and resolve issues more effectively. Consequently, our build times have significantly decreased.
- Build: ≈ 1 min
- Unit tests: ≈ 3 min
- UI test: ≈ 1 min
- Deploy: ≈ 30 sec
We set up monitoring for these steps so we could observe them dynamically.
Tools
These are the tools that helped us improve the process.
Esbuild
We have transitioned to using Esbuild for our project's build process and, experiencing a substantial increase in speed compared to our previous setup with Webpack. The primary bottleneck we faced with Webpack was its long build times. Now, with Esbuild integrated into our development environment, our build times have been dramatically reduced to about five seconds, significantly enhancing the efficiency of our workflow.
Additionally, Esbuild offers an intuitive plugin system that is particularly user-friendly for frontend developers. This system allows creation of custom plugins, which streamline various processes within Esbuild, further increasing its versatility and usefulness in our development process.
Playwright
Playwright, a UI testing framework developed by Microsoft, offers an exceptional developer experience. It streamlines the management and debugging of UI test errors, making it easier and more efficient to identify and resolve issues. Playwright also delivers high performance, enhancing test execution times and overall efficiency.
CloudWatch (AWS)
We have also integrated CloudWatch into our infrastructure for monitoring metrics, which has streamlined our monitoring process, making it more effective.
Custom Visualization
To effectively monitor the timing and key focus areas within our CI/CD pipeline, created a detailed graph. This visual aid clarifies the sequence and pivotal steps of the process. To achieve this, I initiated an internal hackathon. In just one day, a team of two developers crafted an internal service to visualize this workflow, giving us a comprehensive view of our entire CI/CD process.
Big wins we've scored
To date, our achievements include segmenting our process into distinct parts, resulting in a threefold increase in build speed and a significant enhancement of our feedback loop.
This improvement has enabled us to closely monitor our entire CI/CD pipeline, making the development of internal tools not only productive but also enjoyable. Interestingly, although it wasn't our initial goal, this approach has led to an impressive 200% reduction in our monthly resource consumption.
Where to start
To optimize your CI/CD pipeline, start by breaking down the process into distinct steps. Prioritize not only the major steps but also those with the most significant impact on the overall timeline. Collaboration is essential: integrate tool building and CI/CD infrastructure setup seamlessly.
Additionally, transition from on-demand to continuous measurement of these steps to maintain a daily overview of releases and process efficiency. This approach is crucial for anyone aiming to enhance observability and the tools that support it.
Top comments (0)