Throughout my career, I have believed that it is essential to dig into mysterious pathological systems behavior even if it seems somewhat tangential, for it can often reveal problems deep in the system that can have much more damaging presentation.

For a vivid example of pathological behavior, a deep underlying problem, and (especially) the methodology to connect one to the other, read this extraordinary analysis from @oxidecomputer engineer Dave Pacheco:

https://github.com/oxidecomputer/omicron/issues/1146

cockroachdb crashed in Go runtime during test run: `s.allocCount != s.nelems` · Issue #1146 · oxidecomputer/omicron

There's a lot of detail in this report. For a summary of this problem, the root cause, and a workaround, see this comment below. Again trying to reproduce #1130, I found a different issue that ...

GitHub
@bcantrill @oxidecomputer that's not a bug report, that is a thesis

@bcantrill also this is a very solid Undergrad Moment but reading the first comment is reminding me of the compilers project i finished

two weeks ago

huh

@bcantrill I would not blame the Golang runtime if it got a restraining order against Pacheco, this seems intrusive and excessively obsessive. /s

Top tier hacking right there!

@oxidecomputer

@bcantrill @oxidecomputer
> cockroachdb

Stay. Away. From. That. Cursed. Thing.

Well unless you have the resources to fix all its problems.

@bcantrill @oxidecomputer

Read the writeup and I must say my gut reaction to the title was wrong.

That's some excellent writeup and impressive debugging. I wish my job allowed me to go down such rabbit holes.

I'm still curious how well cockroach works out for you and if you had any issues with its network timeouts yet.

@bcantrill thanks for sharing that fascinating (thesis length!) debugging tale.

It seems to be the second “compiler assumed register was preserved, in reality it wasn’t always” weird bug I’ve seen written up recently. The other one was Chrome / C++ in the context of third party code injected into the process (which seemed to be as difficult to debug as it sounds!)

https://randomascii.wordpress.com/2022/12/14/compiler-tricks-to-avoid-abi-induced-crashes/

Compiler Tricks to Avoid ABI-Induced Crashes

Random ASCII - tech blog of Bruce Dawson
@bcantrill @oxidecomputer
The whole analysis also teaches on how to analyze a problem and shows how mdb and Dtrace are used to ask questions to the system to try to find out what's happening, once you start using mdb and Dtrace is hard to go back to systems that don't have those tools.
@bcantrill @oxidecomputer wow, a pretty detailed report with a VERY deep investigation. Kudos to Dave! 👍
Would make a great Netflix show 🍿
@bcantrill @oxidecomputer Great debugging and very well written notes. Makes me wish I was back in the game.

@bcantrill @oxidecomputer i look for some kind of blanket term to advocate fearlessness, the drive to keep going lower and lower through the machine until it starts making sense. that willingness to roll up your sleeves & dive in & find out is an essential mindset to have available when working with computers, is how we attain real knowledge & make informed choices.

this is an incredible tour de force going so very deep into the machine. but even just breaking open the source of the library we might be using is a critical step a lot of devs bounce off of.

trying to use Maven plugins was perhaps the greatest teacher of this lesson to me, so so long ago. most ended up being fairly direct & simple, but the documentation & interfaces, as an external user, had always left vast mystery & uncertainty & confusion.

getting lower in the machine needs stronger advocacy! yes we can! we can find out!

@oxidecomputer @bcantrill That is an epic bug hunt. “Trust nothing” is sometimes the attitude needed.
@bcantrill @oxidecomputer ... and mysteries are everywhere! Here are some screenshots of our 3am mysterious "mosquito swarm" response times from our bare metal API.