How to create a SLO for Cloud Run programatically

#googlecloud #monitoring #slo #sre

The goal of this post is not to explain what Cloud Run or a SLO is, but providing sample code explaining how to programatically set it up using Google API's.

If you want more context around SLO's and general SRE concepts I recommend taking a look at the free Google SRE book and more specifically the SRE chapter.

Service level objectives (SLOs) specify a target level for the reliability of your service. Because SLOs are key to making data-driven decisions about reliability, they’re at the core of SRE practices

Why this post? Recently I faced a scenario where I had the need to dynamically create Cloud Run services, it's pretty straightforward to create a SLO using the Cloud Run UI with some simple steps, but if you are creating them by using the Python SDK or any other programming language SDK, SLO operations are not available.

At the time this post was written there was no public documentation around using SLO's with Cloud Run through API's, so I wanted to share how I did it.

How to create the SLO

Many users are not aware but sometimes the newest API operations or available features are not immediately available on Google SDK's, but you have something they call discovery api client:

In summary, the Google API Discovery service simplifies the process of working with Google APIs by providing structured and standardised documentation, which under the hood is utilised by their own client libraries:

Basically it's a document that tells machines how to interact with their API's, which sometimes can be useful as documentation. I recommend always using each of Google services SDK's, and relying on the discovery client if the operation is not available in the SDK or if you want to get more details on what is available for that service with its models.

Then how to use it?

First you start by installing the google-api-python-client PyPI package.

Next after at looking at the discovery JSON that you can get in this link, and finding what is the right service and operation you need to call, you build the service object:

By inspecting what the Cloud Run UI was doing, I got to the monitoring service and that I needed to basically do 3 steps:

First make sure you have created your Cloud Run service and copy it's name.
Call the service create operation with your Cloud Run service name using the monitoring API's.
Call the create_service_level_objective API for each SLO using the service name generated at #2 and not #1

I ended up creating two SLO's

SLO for latency using a calendar day config
SLO for status health using a rolling day config

The full code sample is here, hope it helps!

import logging
import os

from google.cloud import monitoring_v3
import googleapiclient.discovery
from googleapiclient import errors


logger = logging.getLogger(__name__)


def run(project_id: str, location: str, service_name: str) -> None:
    try:
        monitoring_client = monitoring_v3.ServiceMonitoringServiceClient()

        api_service_name = 'monitoring'
        api_version = 'v3'
        # https://developers.google.com/apis-explorer/
        discovery_url = f'https://{api_service_name}.googleapis.com/$discovery/rest?version={api_version}'
        service = googleapiclient.discovery.build(api_service_name, api_version, discoveryServiceUrl=discovery_url)

        body = {
                "displayName": service_name,
                "cloudRun": {
                    "serviceName": service_name,
                    "location": location
                }
            }

        created_service = service.services().create(parent=f'projects/{project_id}', body=body).execute()

        if created_service:
            service_id = created_service['name'].split("/")[-1]
            slo_configuration = monitoring_v3.ServiceLevelObjective()
            slo_configuration.display_name = '90% - Latency - Calendar day'
            slo_configuration.goal = 0.9

            request = monitoring_v3.CreateServiceLevelObjectiveRequest()
            slo_configuration.calendar_period = "DAY"
            sli_configuration = monitoring_v3.ServiceLevelIndicator()
            sli_configuration.basic_sli = {
                "latency": {
                    "threshold": "1200s"
                }
            }
            slo_configuration.service_level_indicator = sli_configuration
            request.service_level_objective = slo_configuration
            service_name_for_slo = f'projects/{project_id}/services/{service_id}'
            request.parent = service_name_for_slo
            monitoring_client.create_service_level_objective(request)

            slo_configuration = monitoring_v3.ServiceLevelObjective()
            slo_configuration.display_name = '90% - Availability - Rolling day'
            slo_configuration.goal = 0.9

            request = monitoring_v3.CreateServiceLevelObjectiveRequest()
            slo_configuration.rolling_period = "86400s"
            sli_configuration = monitoring_v3.ServiceLevelIndicator()
            sli_configuration.basic_sli = {
                "availability": {}
            }
            slo_configuration.service_level_indicator = sli_configuration
            request.service_level_objective = slo_configuration
            service_name_for_slo = f'projects/{project_id}/services/{service_id}'
            request.parent = service_name_for_slo
            monitoring_client.create_service_level_objective(request)
    except errors.HttpError:
        logger.info("Monitoring SlO's already created, skipping")


if __name__ == '__main__':
    project_id = os.getenv('PROJECT_ID')
    location = os.getenv('LOCATION')
    service_name = os.getenv('CLOUD_RUN_SERVICE_NAME')
    run(project_id=project_id,
        location=location,
        service_name=service_name)