Mastodawn

J. Kalyana Sundaram Mar 15, 2023

Many people think of "Distributed Tracing" as being valuable only when MANY apps/services in a system participate in it.

But even if you enable it for just a SINGLE service in your system, it can save you hours...

Here is WHY (based on a real issue):

We have one service that uses an open-source SDK to achieve a part of its functionality.

One day, we saw an issue where the calls made to this SDK are taking several seconds to complete.

Our service itself is instrumented using #OpenTelemetry Tracing APIs. But how do we diagnose this problem happening in the library it uses?

(Thread 1/3)

Show thread

J. Kalyana Sundaram

Typically this would involve lengthy back-and-forth interactions between the owners of the service, the owners of the SDK, the owners of the various services that the SDK is calling, etc.

Adding more logs or metrics to *our* service wouldn't have helped, of course.

So, how did DT help here?

(Thread 2/3)

Show thread

J. Kalyana Sundaram Mar 15, 2023

The good news is that this SDK was instrumented using OpenTelemetry. So, it emitted spans for various key operations.

Due to the extensible nature of OpenTelemetry, all the spans (whether they be spans from this SDK, or spans emitted by our service) can be exported to a backend of choice, where the trace can be reconstructed.

By viewing such a representative trace, we were able to pinpoint the exact sub-operation within the SDK that was causing the delay.

As more such libraries instrument using OpenTelemetry, Distributed Tracing can improve Observability even WITHIN the context of a single application/service, and save several hours of back-and-forth investigations...

This is the beauty of OpenTelemetry's mission "to enable effective observability by making high-quality, portable telemetry ubiquitous."

(Thread 3/3)