J. Kalyana Sundaram

81 Followers
70 Following
21 Posts

Love to demystify topics related to Reliability, Perf, and Observability of cloud systems.

Principal Software Engineer in Azure @ Microsoft working on building Observability platforms. Previous contributions include Azure Event Grid, Windows 10 Search, BizTalk RFID server, Windows Update etc.

I also serve as the co-chair for the W3C Distributed Tracing working group where we work on the TraceContext and Baggage specifications.

Love hiking, singing, board games, travel.

All opinions my own.

Bloghttps://blog.techlanika.com

Used an app to measure the sound levels of my electric razor: It is 80+ dB. I had always felt that it was very loud but hadn't done anything about it. Now, I have started using it with regular earplugs and it is much better!

The blender in my kitchen is likely 90+ dB - planning to use earplugs for it.

I realized only recently that the decibel scale is a logarithmic scale. So, compared to 60 dB, 80 dB is 100 times more intense and 4 times louder.

Musical instruments that my kids play are at 90 - 95 dB - dangerous sound levels. Ordered some "music earplugs" that won't muffle the sound...

The good news is that this SDK was instrumented using OpenTelemetry. So, it emitted spans for various key operations.

Due to the extensible nature of OpenTelemetry, all the spans (whether they be spans from this SDK, or spans emitted by our service) can be exported to a backend of choice, where the trace can be reconstructed.

By viewing such a representative trace, we were able to pinpoint the exact sub-operation within the SDK that was causing the delay.

As more such libraries instrument using OpenTelemetry, Distributed Tracing can improve Observability even WITHIN the context of a single application/service, and save several hours of back-and-forth investigations...

This is the beauty of OpenTelemetry's mission "to enable effective observability by making high-quality, portable telemetry ubiquitous."

(Thread 3/3)

Typically this would involve lengthy back-and-forth interactions between the owners of the service, the owners of the SDK, the owners of the various services that the SDK is calling, etc.

Adding more logs or metrics to *our* service wouldn't have helped, of course.

So, how did DT help here?

(Thread 2/3)

Many people think of "Distributed Tracing" as being valuable only when MANY apps/services in a system participate in it.

But even if you enable it for just a SINGLE service in your system, it can save you hours...

Here is WHY (based on a real issue):

We have one service that uses an open-source SDK to achieve a part of its functionality.

One day, we saw an issue where the calls made to this SDK are taking several seconds to complete.

Our service itself is instrumented using #OpenTelemetry Tracing APIs. But how do we diagnose this problem happening in the library it uses?

(Thread 1/3)

A short hike at Seward Park, Seattle…

Earlier this week, I read the "The Tail At Scale" paper by Jeffrey Dean and Luiz André Barroso.

I really liked the intuitive techniques described in it.

I wrote the below blog post to try draw an analogy with a physical world example, and to summarize my main takeaways from it on:

- What is tail latency
- Why should we care about it
- Why reducing component level variability is not sufficient
- Two classes of patterns to become tail-tolerant

https://blog.techlanika.com/reducing-tail-latency-three-patterns-to-improve-responsiveness-of-large-scale-systems-47b5664baf61

Reducing Tail Latency: Three Patterns to Improve Responsiveness of Large-Scale Systems

Let’s say you run a travel agency. You deal with customer requests to look up travel information from different datasets. To start with, you are the only person, and soon you start getting a lot of…

Medium

The recent problems with Southwest Airlines is a good example of a Metastable failure at scale in the physical world:

TRIGGERs: Capacity reducing triggers (reduced staff capacity due to sickness, snow storms at Denver, Chicago, and the rest of the country).

AMPLIFICATION: Capacity degradation amplification caused by a combination of factors such as:
—-point to point business model meant the crew is not in the right places,
—-scheduling software breaking down resulting in manual matching of flights to crews - (can’t even imagine how tedious this would have been…kudos to the manual schedulers)
—-crew not able to communicate with the airlines (!) due to phone systems being down, likely due to a metastable failure of the phone system caused by overload due to customers trying to reach the airline for rescheduling..

So, even if the matching of a flight to a crew was done, the crew might not have been aware of that assignment! So, even as “system capacity” (airport, flights, crew) started becoming available, they couldn’t be used effectively…

MITIGATION: As with many metastable failure mitigations, load shedding was the mitigation- they temporarily reduced the number of flights to 1/3rd of the usual number…

Looks like the airline was running the system in an extremely vulnerable state (optimizing for high turnaround time to improve efficiency and packing the schedule without any headroom to handle overloads caused by capacity degradation).

Hope they do a thorough incident analysis using the metastable failure framework and make improvements…

References:

https://www.cnn.com/2022/12/27/business/southwest-airlines-service-meltdown/index.html

https://www.cnn.com/2022/12/29/business/southwest-airlines-service-meltdown/index.html

Just published this blog post on the beauty of the 4 positive feedback loops that writing brings to the table:

https://blog.techlanika.com/4-reasons-why-you-should-write-online-487e1b5831dd

Feedback/comments welcome.

Happy Holidays!

4 Reasons Why You Should Write Online - J. Kalyana Sundaram - Medium

It is almost the end of 2022. In this post, I want to reflect on how writing creates positive feedback loops. Earlier this year, I wrote a post called “Metastable Failures in Distributed Systems…

Medium

To read an industry/research paper, to understand/internalize it reasonably well, and more importantly to *retain* better the learnings from it (to be able to apply them later), two good forcing functions I have found useful are:

- Presenting/talking about it in a forum; committing to doing this creates an accountability/deadline to get it done.

(and/or)

- Writing/blogging about it in my own words.

[Edit: And of course, the above should be done with the goal that attendees or readers get good value out of it (since they are "paying" for it with their time)].

Along those lines, I had the opportunity to share my learnings from the TAOBench paper (https://www.vldb.org/pvldb/vol15/p1965-cheng.pdf) last month in the "Distributed Systems Reading Group" (thanks for the opportunity!) that's organized by Aleksey Charapko (http://charap.co/).

In case you find it useful, here's a link to my presentation: https://www.youtube.com/watch?v=PClXmtEetgA.

The same channel (DistSys Reading Group channel) also has recordings to a bunch of other presentations as well.

Best of Metadata in 2022

It is time to reflect back on the year. Here are some selected papers I covered in 2022. Production systems Amazon Aurora: Design Considerat...