If you are, or have ever been, a sysadmin, grab a bag of popcorn and cruise through this one. Thrills and chills! https://thomasp.vivaldi.net/2023/07/28/what-happened-to-vivaldi-social/
What happened to Vivaldi Social? | Thomas Pike’s other blog

A deep dive into the events of Saturday 8 July 2023, when user accounts started disappearing from the Vivaldi Social Mastodon instance.

Thomas Pike’s other blog
@timbray that was a great write up. Gotta love replication delays and race conditions 🙈
@timbray
Wow, that took me back to my sysadmin days ... I'm appreciating retirement a little more after reading that adventure.
@timbray Ouch, that made me wince. I dealt with a similarly convoluted issue a couple weeks ago. Shout out to all the devs and sys admins who keep this platform running!

@timbray oh my so, umm: a race condition, a mismatched parentheses, call by ref vs val, possible Unicode string .. only missing an off-by-one error for Yahtzee .

This is a very good write up and I wish we got more that are this candid.

@timbray I'm glad they were able to restore the accounts. I'm a Vivaldi browser user (for my work tasks) but was already active on a good server (ruby.social) when they started with Mastodon and didn't sign up.

As an amateur sysadmin, I've had a LOT of trouble with my own attempts at
#Mastodon/#Akkoma/#Pleroma/#GoToSocial server installation, and I can't imagine running it at such a high level.

@timbray Neatly described.

My only beef is the passage “all local accounts in a Mastodon instance have a null value in their URI field, so they all matched”. But null ≠ null, AccountMergingWorker isn’t matching correctly.

@denspier It's Ruby, no? In Ruby nil == nil.
@timbray I understood it was in PostgreSQL? Regardless, if null is the default value, and all local accounts have null, null and null cannot be regarded as a match, regardless of language.
@vitex @gandalf tak my naštěstí nepoužíváme replikaci Postgresu. Jinak většinou to bývá takováhle nějaká demence a ne útok..
@timbray that reminded me of this article about the exact same problem that caused the issue: https://brandur.org/job-drain
Transactionally Staged Job Drains in Postgres — brandur.org

@timbray this is why I have resisted starting my own server. That's quite a harrowing experience, taking days to fully resolve even with a quite technical team with upstream dev support, all putting in heroic hours. I aint got that kind of time.

@timbray As much as I like the writeup, it again illustrates one issue I keep having with Devs as a Sysadmin: they rarely plan for scaling.

Running read queries against read-only replication targets that might have a delay is one of the best ways to increase performance, response times, and uptime, so it should be supported in some way.

@timbray Better write up , more interesting and more suspense than half of the movies you can watch.