Karina Babcock for Causely

Posted on Mar 5

Launching our new integration with OpenTelemetry

#observability #opentelemetry #devops

Bridging the gap between observability data and actionable insight

Observability has become a cornerstone of application reliability and performance. As systems grow more complex—spanning microservices, third-party APIs, and asynchronous messaging patterns—the ability to monitor and debug these systems is both a necessity and a challenge.

OpenTelemetry (OTEL) has emerged as a powerful, open source framework that standardizes the collection of telemetry data across distributed systems. It promises unprecedented visibility into logs, metrics, and traces, empowering engineers to identify issues and optimize performance across multiple languages, technologies and cloud environments.

But with great visibility comes a hidden cost. While OTEL democratizes observability, it also exacerbates the “big data problem” of modern DevOps.

This is where Causely comes in—today, we announced a new integration with OTEL that bridges the gap between OTEL's data deluge and actionable insights. In this post, we’ll explore the strengths and limitations of OpenTelemetry, the challenges it introduces, and how Causely transforms raw telemetry into precise, cost-effective analytics.

The OpenTelemetry opportunity

Microservices are a tangled web of interdependencies that communicate over REST or gRPC. Asynchronous systems like Kafka shuttle messages between loosely coupled services. Infrastructure dynamically scales resources to meet demand. Observability has become the glue that holds these systems together, enabling engineers to monitor performance, troubleshoot issues, and ensure reliability.

At the heart of the observability revolution is OpenTelemetry (OTEL), an open-source standard that unifies the instrumentation and collection of telemetry data across logs, metrics, and traces. Its modular architecture, community-driven development, and broad compatibility with existing observability tools have made OTEL the de facto choice for modern DevOps teams.

What does OpenTelemetry do?

OpenTelemetry provides APIs, SDKs, and tools to capture three primary types of telemetry data:

Logs: Detailed, timestamped records of system events (e.g., errors, warnings, and custom events).
Metrics: Quantitative measurements of system health and performance (e.g., CPU usage, request latency, error rates).
Traces: End-to-end views of requests flowing through distributed systems, mapping dependencies and execution paths.

With OTEL, engineers can instrument their code to emit these telemetry signals, use an OpenTelemetry Collector to aggregate and process the data, and export it to observability backends like Prometheus, Tempo, or Elasticsearch.

Why OpenTelemetry is a game-changer

OpenTelemetry addresses a critical pain point in observability: fragmentation. Historically, different tools and platforms required unique instrumentation libraries, making it difficult to standardize observability across an organization. OTEL simplifies this by providing:

Vendor-Agnostic Instrumentation: A single API to instrument applications regardless of the backend.
Centralized Data Collection: The OpenTelemetry Collector serves as a pluggable data pipeline, consolidating telemetry from various sources.
Interoperability: Native support for popular backends like Prometheus, Tempo, and other vendors, allowing teams to integrate OTEL into their existing observability stack.

Technical example: Debugging latency issues

Consider a microservices-based e-commerce application experiencing high latency during checkout. With OTEL traces, you can get a lot of information about the performance of this service but it is hard to find out what is responsible for the latency. For example: https://github.com/esara/robot-shop/blob/instrumentation/dispatch/main.go#L172


func processOrder(headers map[string]interface{}, order []byte) { 
    start := time.Now() 
    log.Printf("processing order %s\n", order) 
    tracer := otel.Tracer("dispatch") 

    // headers is map[string]interface{} 
    // carrier is map[string]string 
    carrier := make(propagation.MapCarrier) 
    // convert by copying k, v 
    for k, v := range headers { 
       carrier[k] = v.(string) 
    } 

    ctx := otel.GetTextMapPropagator().Extract(context.Background(), carrier) 

    opts := []oteltrace.SpanStartOption{ 
       oteltrace.WithSpanKind(oteltrace.SpanKindConsumer), 
    } 
    ctx, span := tracer.Start(ctx, "processOrder", opts...) 
    defer span.End() 

    span.SetAttributes( 
       semconv.MessagingOperationReceive, 
       semconv.MessagingDestinationName("orders"), 
       semconv.MessagingRabbitmqDestinationRoutingKey("orders"), 
       semconv.MessagingSystem("rabbitmq"), 
       semconv.NetAppProtocolName("AMQP"), 
    )

By exporting these traces to a backend like Tempo, engineers can visualize the request flow and identify bottlenecks, such as consuming messages from RabbitMQ in the dispatch service and inserting an order in a MongoDB database.

The Big Data problem of observability

OpenTelemetry’s ability to capture detailed telemetry data is a double-edged sword. While it empowers engineers with unprecedented visibility into their systems, it also introduces challenges that can hinder the very goals observability aims to achieve. The sheer volume of data collected—logs, metrics, and traces from thousands of microservices—can overwhelm infrastructure, slow down workflows, inflate costs, and most importantly drown engineers with data. This “big data problem” of observability is a natural consequence of OpenTelemetry’s strengths but must be addressed to make the most of its potential.

OpenTelemetry collects a lot of data

At its core, OpenTelemetry is designed to be exhaustive. This design ensures engineers can instrument their systems to capture every possible detail. For example:

A high-traffic e-commerce site might generate logs for every HTTP request, metrics for CPU and memory usage, and traces for each request spanning multiple services.

OpenTelemetry auto instrumentation libraries are an easy way to instrument HTTP, GRPC, messaging, database and caching libraries in all languages, but they generate metrics and traces for every call between every microservice, managed service, database and third-party API.

Consider a production environment running thousands of microservices, each processing hundreds of requests per second. Using OpenTelemetry:

Logs: A single request might generate dozens of log entries, resulting in millions of logs per minute.
Metrics: Resource utilization metrics are emitted periodically, adding continuous streams of quantitative data.
Traces: Distributed traces can contain hundreds of spans, each adding its own metadata.

While this level of detail is invaluable for debugging and optimization, it quickly scales beyond what many teams are prepared to manage. The amount of data makes it difficult to troubleshoot problems, manage escalations, be proactive about deploying new code, and plan for future investments.

The cost of data

The problem with this massive volume of telemetry data isn’t just about storage; it’s also about processing and time-to-insight. Let’s break it down:

Networking Costs: Transmitting telemetry data from distributed systems, microservices, or edge devices to central storage or processing locations incurs significant bandwidth usage. This can result in substantial networking costs, especially for real-time telemetry pipelines or when dealing with geographically dispersed infrastructure.

Storage Costs: Logs, metrics, and traces consume vast amounts of storage, often requiring specialized solutions like Elasticsearch, Amazon S3, or Prometheus’s TSDB. These systems must scale horizontally, adding significant operational overhead.
Compute Costs: Telemetry data needs to be parsed, indexed, queried, and analyzed. Complex queries, such as joining multiple traces to identify bottlenecks, can place a heavy burden on compute resources.
Time Costs: During a high-severity incident, every second counts. Pinpointing the root cause is like looking for a needle in a haystack. With OpenTelemetry, the haystack is much bigger, making the task harder and longer.

Time-to-insight delays

Imagine a scenario where an outage occurs in a distributed system. An engineer might start by querying logs for errors, then switch to metrics to identify anomalies, and finally inspect traces to pinpoint the failing service. Each query takes time, and engineers often waste effort chasing irrelevant leads. This delay increases Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR), directly impacting uptime and user satisfaction.

Noise vs. signal

Another challenge is separating the signal (useful insights) from the noise (redundant or irrelevant data). With OTEL:

Logs can be overly verbose, capturing routine events that clutter debugging efforts.

Metrics might lack the context needed to tie resource anomalies back to specific root causes.

Traces can become overwhelming in high-traffic systems, with thousands of spans providing more detail than is actionable.

While OTEL excels at capturing data, it doesn’t inherently prioritize it. This creates a bottleneck for engineers who need actionable insights quickly.

The need for top-down analytics

Along with the benefits of modern observability tooling come challenges that need to be addressed. OpenTelemetry (OTEL) may unify telemetry data collection, but its bottom-up approach leaves teams drowning in redundant metrics, irrelevant logs, and sprawling traces. Without a clear purpose, teams end up collecting everything “just in case,” overwhelming engineers with noise and diluting the actionable insights needed to keep systems running.

A top-down approach to observability flips the script. Instead of starting with what data is available, it begins with defining the goals: root cause analysis, SLO compliance, or performance optimization. By focusing on purpose, teams can build the analytics required to achieve those goals and then collect only the data necessary to power those insights.

For example:

If the goal is root cause analysis, focus on traces that map dependencies across microservices, rather than capturing every granular log.
If the goal is performance optimization, prioritize metrics that highlight latency bottlenecks over exhaustive resource utilization data.
This shift reduces noise, minimizes data storage and processing costs, and accelerates time-to-insight.

The cost of ignoring purpose

The current approach to observability is plagued by fragmentation. Point tools like APMs, native Kubernetes instrumentation, and cloud-specific monitors operate in silos, each with its own data model and semantics. This forces engineers to manually correlate information across dashboards, increasing time to resolution and undermining efficiency. Over time, the storage, compute, and human costs of managing fragmented data become unsustainable.

Ask yourself:

How much of your telemetry data is redundant or irrelevant?
Are your engineers spending more time troubleshooting tools than resolving incidents?
Is your observability stack delivering insights or merely adding complexity?
Without a unified purpose and targeted analytics, observability becomes another “big data problem,” and your total cost of ownership (TCO) spirals out of control.

Causely can help

Causely transforms OpenTelemetry’s raw telemetry data into actionable insights by applying a top-down, purpose-driven approach. Instead of drowning in logs, metrics, and traces, Causely’s platform leverages built-in causal models and advanced analytics to automatically pinpoint root causes, prioritize issues based on service impact, and predict potential failures before they occur. This turns observability from a reactive big data challenge into a system that continuously assure application reliability and performance.

How Causely brings focus

Causely’s platform addresses these challenges head-on. Its causal reasoning starts with defining what matters: actionable insights to keep systems performing reliably and efficiently. Using built-in causal models and top-down analytics, Causely automatically pinpoints root causes and eliminates noise. By integrating with OTEL and other telemetry sources, Causely ensures that only the most critical data is collected, processed, and presented in real time.

For example:

In a microservices architecture, Causely maps dependencies and pinpoints the root cause of cascading failures, reducing MTTR.
Similarly, with async messaging systems like Kafka, Causely pinpoints the bottlenecks that cause the consumer lag or delivery failures with actionable context, ensuring faster resolution.
In cases where a third-party software is a root cause of issues, Causely pinpoints the root cause by analyzing services impact.
This approach not only reduces the TCO of observability but also ensures teams can focus on delivering value rather than managing data.

How Causely works with OpenTelemetry

Causely Reasoning Platform is a model-driven, purpose-built Agentic AI system delivering multiple AI workers built on a common data model.

Causely integrates seamlessly with OpenTelemetry, using its telemetry streams as input while applying context and intelligence to deliver precise, actionable outputs. Here’s how Causely solves common observability challenges:

Automated topology discovery: Causely automatically builds a dependency map of your entire environment, identifying how applications, services, and infrastructure components interact. OpenTelemetry’s traces provide raw data, but Causely’s topology discovery transforms it into a visual graph that highlights critical paths and dependencies.
Root cause analysis in real time: Using causal models, Causely automatically maps all potential root causes to the observable symptoms they may cause. Causely uses this mapping in real time to automatically pinpoint the root causes based on the observed symptoms, prioritizing those that directly impact SLOs. For instance, when request latency spikes are detected across multiple services, Causely pinpoints whether the spikes stem from a database query (and which database), a messaging queue (and which queue), or an external API (and which one), reducing MTTD and MTTR.
Proactive prevention: Beyond solving problems, Causely helps prevent them. Its analytics can simulate “what-if” scenarios to predict the impact of configuration changes, workload spikes, or infrastructure upgrades. For example, Causely can warn you if scaling down a Kubernetes node pool might lead to resource contention under expected load.

Example 1: Causely, OTEL, and microservices

In a distributed e-commerce platform, a checkout service experiences intermittent failures. OpenTelemetry traces capture the flow of requests, but the data alone doesn’t explain the root cause. Causely’s causal models analyze the traces and identify that a dependent payment service is timing out due to a slow database query. This insight allows the team to address the issue without wasting time on manual debugging.

Example 2: Causely, OTEL, and third-party software

A team - using a third-party CRM API - notices degraded response times during peak hours. OpenTelemetry provides metrics showing increased latency, but engineers are left guessing whether the issue lies with their application or the external service. Causely reasons about the API latency and third-party requests and identifies that the CRM is rate-limiting requests, prompting the team to implement retry logic.

Example 3: Causely, OTEL, and async messaging with Kafka

A Kafka-based event pipeline shows sporadic delays in message processing. While OpenTelemetry traces highlight lagging consumers, it doesn’t explain why. Causely, reasoning about the behavior of the consumer microservices, identifies the root cause in the application’s mutex locking which is causing the slow consumption. The engineering team can focus on improving the locking of the data structure without the messaging infrastructure team having to scale up resources and waste time debugging Kafka.

Reducing the big data burden

Causely’s approach minimizes the data burden by focusing on relevance. Unlike traditional observability stacks that collect and store massive volumes of telemetry data, Causely processes raw metrics and traces locally, pushing only relevant context (e.g., topology and symptoms) to its backend analytics. This reduces storage and compute costs while ensuring engineers get the insights, they need, without delay.

Conclusion: Transforming observability with Causely

OpenTelemetry has redefined observability by standardizing how telemetry data is collected and processed, but its bottom-up approach leaves teams overwhelmed by the sheer volume of logs, metrics, and traces. Observability shouldn’t be about how much data you collect—it’s about how much insight you can gain to keep your systems running efficiently. Without clear prioritization and contextual insights, the observability stack can quickly become a costly burden—both in terms of infrastructure and engineering time.

Causely integrates seamlessly with OpenTelemetry and helps bring order to the chaos, empowering teams to make smarter, faster decisions that directly impact reliability and user experience. Causely uses causal models, automated topology discovery and real-time analytics to pinpoint root causes, prevent incidents, and optimize performance. This reduces noise, eliminates unnecessary data collection, and allows teams to focus on delivering reliable systems rather than managing observability overhead.

Ready to move beyond data overload and transform your observability strategy? Book a demo or start your free trial to see how Causely can help you take control of your telemetry data and build more reliable cloud-native applications.

DEV Community