Page It to the Limit
Smart Alerting and Service Level Objectives’s With Yuri Grinshteyn
The History of Alerting
Yuri discusses the history of alerting and why alerting should be on things that impact the customer experience.
Yuri: “It’s one of the soapboxes that I find myself on when talking to customers. It so often happens that customers will look for help with alerting; ‘How will I know if I am having an issue with my infrastructure, how will I know if I have high memory consumption?’ You should never alert on things in your infrastructure, you should only alert on things that impact your customer experience.”
Yuri and Julie discuss common mistakes customers make dating back to the beginning of alerting and how products like PagerDuty have changed the way alerting should be done.
A little bit about PagerDuty
Yuri and Julie discuss where PagerDuty came from and where it is today.
Yuri: “The fundamental problem that you are trying to solve, which is like hey I get 1,000 alerts an hour and I don’t know which ones are important. That’s not the problem we should be buying PagerDuty to solve, that’s the problem we should be addressing at the root.”
Julie and Yuri continue to discuss the issues with email filtering and essentially “training people to ignore alerts”
Service Level Indicator and Service Level Objective Usage
Yuri talks about how it’s about the service owner and who is ultimately accountable for reliability of that system and what user happiness means in this context. He continues to discuss SLI’s and SLO’s.
Yuri: “We use SLI’s as a proxy for user happiness.”
Yuri and Julie discuss setting up alerting with SLI’s and SLO’s in mind along with alerts needing to be humanly actionable, with a little bit of error budgets sprinkled in.
Yuri: “Things that are not directly contributing to or impacting user happiness, those should be created as tickets in a ticketing system… there is no need to wake someone in the middle of the night.”
Continued discussion around when alerting should wake someone up in the middle of the night.
Julie: “Making sure every alert that wakes a human up is humanly actionable sounds great but isn’t always easy, and it comes down to fine tuning. Do you have recommendations?”
Yuri: “People often feel that if they don’t have an alert for it, it’s not actually happening..”
Decrypting the Metrics
Julie: “Let’s go deeper on how do we really dig deep into what the customer experience means when you are looking at service level indicators and service level objectives”
Yuri: “You have to have a good understanding of what are people actually trying to do, and then some way of quantifying.”
The discussion continues on what metrics we use to quantify customer success and performance.
Yuri: “The closer we are able to collect this information to the customer, the more accurate it is going to be.”
Technical Debt
Shifting topics to how technical debt is expressed as a gap in knowledge, and how people treat their systems as a black box.
Yuri: “You’ll hear the term ‘black box monitoring’ because they don’t actually know how it works.”
Continued discussion on how technical debt manifests itself in monitoring.
Service Ownership
Yuri discusses what service ownership means to him and how it is really “engineer empowerment”, and what that means to reliability.
Additional Resources
- PagerDuty Home Page
- Episode transcribed by Rev