This is a great writeup of a DB corruption bug and its detection and resolution. Much respect to Claire, these "things that should always happen in the right order have happened in the wrong order because of some particular set of extreme conditions, with surprise downstream consequences" bugs are absolutely the worst. "Let's reason backwards from effects to causes, with the caveat that causality maybe sometimes doesn't exist" is so hard.

https://thomasp.vivaldi.net/2023/07/28/what-happened-to-vivaldi-social/

What happened to Vivaldi Social? | Thomas Pike’s other blog

A deep dive into the events of Saturday 8 July 2023, when user accounts started disappearing from the Vivaldi Social Mastodon instance.

Thomas Pike’s other blog
A few things jumped out at me that were barely mentioned though, particularly around practices at the margins. Editing a whole-DB SQL file in Vim when you're exhausted? Using the edited file without tool validation first? This is a good incident report, but IMO they could stand to do a proper retrospective.

@mhoye

“I know of no case study in history that describes an organization that has been managed out of a crisis. Every single one of them was led.” - Simon Sinek, "Leaders Eat Last"

@mhoye also wtf, splitting by characters not lines?
@mhoye Wow! Very thankful for sysadmins!
@mhoye At one point I was thinking the cause would be something like this xkcd: https://xkcd.com/327/
Exploits of a Mom

xkcd

@mhoye

This is the kind of post-mortem report that you want to see.

I do not believe I would have even considered trying to edit a 54GB file.

cc:@jerry

@mhoye What jumped out at my was the fact that a significant percentage of the Mastodon dev team were engaged over the weekend on this issue. Thanks to them for pitching in, but it doesn’t seem sustainable long term. They need to grow their team or risk burnout and stalling the project. Not sure what plans they have in the works to do that?
@dschwarz that struck me as well, but to understand that we’d need more insight into the frequency of that kind of event than I have. Black swan events are actually fine if you learn from them, but they should be relatively rare.

@mhoye this is not stated explicitly in the postmortem, but it appears to me that the root cause is that their database setup uses _asynchronous_ replication while performing read queries on the secondary (slave) server.

If I'm right, this wouldn't have happened with synchronous replication (at the expense of a huge drop in write performance).

Having deployed this kind of setup for a customer, an event like this was my #1 fear, and I asked my customer to explicitly assume the risks.

@dek @mhoye that, and the fact that creating the account and setting its URL are not done in the same transaction, which is an application design issue.