DEV Community

Yash Nigam
Yash Nigam

Posted on

Understanding Open telemetry and Observability for SRE

Understanding of OpenTelemetry and Observability is essential for an SRE in any org. This blog post is my attempt to lay down a good understanding of OT after reading the following book:

  • Cloud-Native Observability with OpenTelemetry from Packt Publishing

From an high level OT can be described as:

  • A Framework to produce telemetry from your applications using open standards
  • Concept of signals - traces, metrics, and logs
  • Produce telemetry for these signals using OT APIs
  • provides Tools to gain visibility into the performance of your services by combining tracing, metrics, and logging.
  • allows you to instrument your application code through vendor-neutral APIs, libraries and tools.

Before Proceeding with Open telemetry let us list down and understand some other useful concepts and technologies which are interconnected:


Cloud Native Applications

  • There has been a shift to Microservices based Architecture for deploying an running applications aided by cloud services such as Kubernetes and serverless.
  • The Applications are now Distributed amongst multiple cloud services, and scaled horizontally, producing logs at multiple places.
  • The services are loosely coupled and operate independently.
  • In such cases Latency is introduced between calling services as each service sits in it own container.

A Shift towards DevOps

  • Small teams(4 to 6 people) managing their own microservices
  • Developers own the lifecycle of code through all the stages, do all the work write, test, build code, package, deploy and operate the code in prod instance.(with aid of SRE)
  • This Accelerates feature development
  • However, as microservices increase - No one has the full picture, and it becomes difficult to find what caused an outage.
  • Dev teams have to learn multiple tools For Building, Deploying, Monitoring.. etc which shifts their focus from their main task - coding.
  • They may struggle to identify the root cause of production issues as there is not enough visibility across the managed systems.

Observability

Observability can be defined in different ways:

  1. As per https://en.wikipedia.org/wiki/Observability, "In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs."
  2. The the ability to answer questions:
    • Is the system doing what I think it should be?
    • If a problem occurred in production, what evidence would you have to be able to identify it?
    • Why is this service suddenly overwhelmed when it was fine just a minute ago?
    • If a specific condition from a client triggers an anomaly in some underlying service, would you know it without customers or support calling you?
  3. empowering the people who build and operate distributed applications to understand their code's behaviour while running in production

OT ultimately enables observability for the application on which it is configured, Historically observability has been achieved using the following:

1. Centralized logging

  • For an application which is large and distributed across enough systems, searching through the logs on individual machines is not practical.
  • Applications can also run on ephemeral machines that may no longer be present when we need those logs.
  • Need to make the logs available in a central location for persistent storage and searchability, and thus centralized logging was born
  • Tools for logging
    • Fluentd
    • Logstash
    • Apache Flume

2. Metrics and dashboards

  • Measuring application and system performance via the collection of metrics(signals)
  • Metrics can also be used to configure alerting when an error rate becomes greater than an acceptable percentage.
  • Tools:
    • Prometheus
    • StatsD
    • Graphite
    • Grafana

3. Tracing and analysis

  • Tracing applications means having the ability to run through the application code and ensure it's doing what is expected.(Generally done via a debugger in IDE)
  • This becomes impossible when debugging an application that is spread across multiple services on different hosts across a network.
  • Google whitepaper on same: Dapper (https:// research.google/pubs/pub36356/)
  • Tools:
    • Opentracing
    • zipkin
    • Jaegar

4. Challenges

  • Multiple tools for logging, tracing and metrics monitoring
  • Multiple standards, libraries, methods
  • Time needed to instrumenting the code/application to generate logs, traces and metrics, and the time needed to integrate the tools depending on complexity
  • ROI

Describing OpenTelemetry

OT is an ecosystem or an framework for application running on cloud native services.

  1. Standardize how applications are instrumented and how telemetry data is generated, collected, and transmitted
  2. Give users the tools necessary to correlate that telemetry across systems, languages, and applications
  3. An open specification
  4. Language-specific APIs and SDKs
  5. Instrumentation libraries
  6. Semantic conventions
  7. An agent to collect telemetry
  8. A protocol to organize, transmit, and receive the data
  9. OpenTelemetry has implementations in 11 languages

Core Concepts/Categories of conecerns of Opentelemetry

1. Signals

  • Signals represent the core of the telemetry data that is generated by instrumenting.
  • Signals are : a) Tracing b) Baggage c) Metrics d) Logging
  • Real power of OpenTelemetry is to allow its users to correlate data across signals to get a better understanding of their systems

2. Specification:

3. Data Model

4. API

  • Providing users with an API allows them to go through the process of instrumenting their code in a way that is vendor-agnostic.
  • The API is decoupled from the code that generates the telemetry, allowing users the flexibility to swap out the underlying implementations as they see fit
  • A user who instruments their code by using the API and does not configure the SDK will not see any telemetry produced by design.

5. SDK

  • SDK does most of the heavy lifting in OT.
  • Implements the underlying system that generates, aggregates, and transmits telemetry data.
  • Provides the controls to configure how telemetry should be collected, where it should be transmitted, and how.
  • Configuration of the SDK is supported via in-code configuration, as well as via environment variables defined in the specification.
  • As it is decoupled from the API, using the SDK provided by OpenTelemetry is an option for users, but it is not required. Users and vendors are free to implement their own SDKs

6. Instrumentation Libraries

  • Ensures users can get up and running quickly
  • provide instrumentation for popular open source projects and frameworks, in Python, the instrumentation libraries include Flask, Requests, Django, and others.

7. Pipelines

  • Pipelines helps to produce telemetry generated by signal and export them to data store.
  • Each signal implementation offers a series of mechanisms to generate, process, and transmit telemetry.
  • PROVIDER > GENERATOR > PROCESSOR > Exporter

Providers

  • The starting point of the telemetry pipeline is the provider.
  • A provider is a configurable factory that is used to give application code access to an entity used to generate telemetry data.
  • Although multiple providers may be configured within an application, a default global provider may also be made available via the SDK.
  • Providers should be configured early in the application code, prior to any telemetry data being generated.

Generator:

  • To generate telemetry data at different points in the code, the telemetry generator instantiated by a provider is made available in the SDK.
  • This generator is what most users will interact with through the instrumentation of their application and the use of the API.
  • Generators are named differently depending on the signal: the
    • tracing signal calls this a tracer,

Processors

  • Once the telemetry data has been generated, processors provides the ability to further modify the contents of the data.
  • Processors may determine the frequency at which data should be processed or how the data should be exported.

Exporters

  • translate the internal data model of OpenTelemetry into the format that best matches the configured exporter's understanding.
  • Multiple export formats and protocols are supported by the OpenTelemetry project:
    • OpenTelemetry protocol
    • Console
    • Jaeger
    • Zipkin
    • Prometheus
    • OpenCensus

8. Resources

  • used to identify the source of the telemetry data, whether a machine, container, or function
  • used at the time of analysis to correlate different events occurring in the same resource.
  • Resource attributes are added to the telemetry data from signals at the export time
  • Are associated with providers

9. Context propagation

  • Is the core concept of distributed tracing,
  • Provides the ability to pass valuable contextual information between services that are separated by a logical boundary.
  • Context propagation is what allows distributed tracing to tie requests together across multiple systems
  • Allows user defined values (baggage) to be propagated as well
  • defines a context API as part of the OpenTelemetry specification.
  • Python has built-in context mechanisms, ContextVar

Auto Instrumentation, Manual instrumentation and challenges

  • Why Auto instrumentation

    • The upfront cost of instrumenting code can be a deterrent to even getting started, especially if a solution is complicated to implement and will fail to deliver any value for a long time.
    • Auto-instrumentation looks to alleviate some of the burdens of instrumenting code manually
  • Challenges of manual instrumentation

    • The libraries and APIs that are provided by telemetry frameworks can be hard to learn how to use
    • Instrumenting applications can be tricky. This can be especially true for legacy applications where the original author of the code is no longer around
    • Knowing what to instrument and how it should be done takes practice
    • Modifying code means compiling code again and building the artifact again and deploying again
    • The ability to disable instrumentation for a specific cod eblock/module/plugin
  • Components of auto-instrumentation

    • 1. Instrumentation libraries
    • Python - flask, django, boto
    • 2. Agent/runner
    • automatically invoke the instrumentation libraries without additional work on the part of the user
    • configure OpenTelemetry and load the instrumentation libraries that can be used to then generate telemetry
    • What it cannot do
      • cannot instrument application-specific code
      • it may instrument things you're not interested in. This may result in the same network call being recorded multiple times, or generated data that you're not interested in using
  • Instrumentation libraries in Python

    • Any intercepting calls to libraries are instrumented and are replaced at runtime via a technique known as monkey patching (https://en.wikipedia.org/wiki/ Monkey_patch).
    • The instrumenting library receives the original call, produces telemetry data, and then calls the underlying library.
    • Python implementation ships a script that can be called to wrap any Python application.
    • The opentelemetry-instrument script finds all the instrumentations that have been installed in an environment by loading the entry points registered under the opentelemetry_instrumentor name

Overview of Traces, Spans and Logs and Metrics using a sample application with Opentelemetry

A sample application running in docker compose environment.

Sample

Traces

  • Trace Context specification
  • A distributed trace contains events that cross process, network and security boundaries
  • The work captured in a trace is broken into separate units or operations, each represented by a span
  • This specification defines standard HTTP headers and a value format to propagate context information that enables distributed tracing scenarios
  • Distributed tracing is the foundation behind the tracing signal of OpenTelemetry.
  • A distributed trace is a series of event data generated at various points throughout a system tied together via a unique identifier.
  • This identifier is propagated across all components responsible for any operation required to complete the request, allowing each operation to associate the event data to the originating request
  • Example Jaegar trace

Image description

  • A Trace shows
    • Trace ID
    • Start date time
    • Duration
    • Count of services

SPAN

  • span can represent a method call or a subset of the code being called within a method.
  • Multiple spans within a trace are linked together in a parent-child relationship, with each child span containing information about its parent.
  • The first span in a trace is called the root span and is identified because it does not have a parent span identifier

Image description

  • Two Spans can be seen here
  • First one with 7.01 millisecond duration, second with 260 millisecond
  • Each span has span id
  • Tags: representing key value flavours which give information about operation being done
  • Process: represents which process executed this operation

  • SpanContext:

    • Contains information about the trace and must be propagated throughout the system.
    • The elements of a trace available within a span context include the following:
    • A unique identifier, referred to as a trace ID, identifies the request through the system.
    • A second identifier, the span ID, is associated with the span that last interacted with the context.
    • This may also be referred to as the parent identifier. •
    • Trace flags include additional information about the trace, such as the sampling decision and trace level.
    • Vendor-specific information is carried forward using a Trace state field. This allows individual vendors to propagate information necessary for their systems to interpret the tracing data.

Metrics

  • metrics provide information about the state of a running system to developers and operators
  • The data collected via metrics can be aggregated over time to identify trends and patterns in applications graphed through various tools and visualizations.
  • Metrics are critical to monitoring the health of an application and deciding when an on-call engineer should be alerted
  • Metrics form the basis of service level indicators (SLIs) (https://en.wikipedia.org/wiki/Service_level_ indicator) that measure the performance of an application.
  • These indicators are then used to set service level objectives (SLOs) (https://en.wikipedia.org/wiki/ Service-level_objective) that organizations use to calculate error budgets.
  • Opentelmetry primarily uses metrics by:
    • OpenMetrics
    • StatsD,
    • Prometheus
  • Metrics may capture data in various Data Point Types

Image description

Searching a Metric in Prometheus

  • Prometheus collects and stores metrics over time, in a time series database, which can be queried using metric name
  • Request counter metric is a counter which counts the number of incoming requests on a service
  • Here we can see that after querying by metric name "request_counter" we are returned with 3 rows

Prometheus Metrics

  • Each row is for a different service and shows the request_count value, which is a integer - metric of type counter which increases the count

Logs

  • A log is a record of events written to output
  • Loki stores all the logs generated by grocery store application and grafana is used to view it
  • A normal message on console output would be: Filter the logs using the {job="shopper"} query to retrieve all the logs generated by the shopper application
  • shopper | INFO:shopper:message="add orange to cart"

Application log collected by Loki viewed in Grafana

  • However this message in the loki would be below

A Correlated trace of the above log

  • Which contains more details like Traceid, spanid, time..etc
  • The Same log contains trace id hence this can be corelated in Jaegar with trace and span details

Top comments (0)