Mastodawn

The Matrix.org Foundation Sep 2, 2025

the matrix.org homeserver is having problems: https://status.matrix.org/incidents/mm9hdm78svgv apologies for the inconvenience…

Database incident

Matrix's Status Page - Database incident.

Metron Sep 2, 2025

Show thread

The Matrix.org Foundation

So: the matrix.org database secondary lost its FS due to a RAID failure earlier today (11:17 UTC). Then, we lost the primary at 17:26. We're trying to restore the primary DB FS (which could be fastish), while also doing a point-in-time backup restore from last night (which takes >10h). We believe the incremental DB traffic since last night is intact however. Apologies for the downtime; folks on their own homeserver are of course not impacted.

Show thread

The Matrix.org Foundation Sep 2, 2025

Sorry, but it's bad news: we haven't been able to restore the DB primary filesystem to a state we're confident in running as a primary (especially given our experiences with slow-burning postgres db corruption). So we're having to do a full 55TB DB snapshot restore from last night, which will take >10h to recover the data, and then >4h to actually restore, and then >3h to catch up on missing traffic. Huge apologies for the outage. Again, folks using their own homeservers are not impacted.

Show thread

The Matrix.org Foundation Sep 3, 2025

Status update: we’re 47TB through restoring the 55TB db snapshot of the matrix.org DB, but then have to rebuild the DB and replay the subsequent 17h of DB traffic, which will take several hours. Thank you for your patience, and apologies once again for the outage.

Show thread

The Matrix.org Foundation Sep 3, 2025

Status update: we've restored the 55TB snapshot and subsequent incremental backups, and are about to replay the remaining traffic since the backup. There are still several unknowns, but if things go well the matrix.org instance should be back in 3-4 hours.

Show thread

The Matrix.org Foundation Sep 3, 2025

Right, matrix.org is back online as of 17:00 UTC. The server is struggling a bit as it catches up. Huge apologies again for the outage; postmortem + ways to avoid a repeat will be forthcoming. See also https://www.theregister.com/2025/09/03/matrixorg_raid_failure/ & https://www.heise.de/en/news/Matrix-main-server-down-millions-of-users-affected-10630524.html. Thanks all for your patience.

Matrix.org homeserver grinds to a halt after RAID meltdown

: Engineers wrangle 55 TB restore and traffic replay as millions of messages queue up

The Register

Show thread

kontrollierterWahnwitz Sep 3, 2025

@matrix I’m really interested in your post mortem from a professional point of view.

Show thread

altf4 Sep 3, 2025

@matrix welcome back !

Show thread

Vince (biometrically enrolled)Sep 3, 2025

@matrix I should really grab that funny domain name I've been eyeing and host my own instance.

Show thread

지지 ᚠᚱᛖᛃᚨ Daniel 黄法官 CyReVolt Sep 3, 2025

@matrix 🥺

That must have been rough and tough.
We love you! 🧡

Show thread

Cavallo Pazzo Sep 3, 2025

@matrix Thank you!

Show thread

Thomas Frans 🇺🇦Sep 3, 2025

@matrix Thanks to all the incredible people at Matrix who managed to fix this. This must have been a horrible, stressful day.

Show thread

Jennifer Moore 😷Sep 3, 2025

@matrix

Thanks for the updates as it went along!

And thanks to everyone who contributed to the fix!

Show thread

С.Sep 3, 2025

@matrix On my end, I still have issues when trying to log in.

Show thread

The Matrix.org Foundation Sep 3, 2025

@THB_STX we’re not aware of any issues - can you send details to [email protected] please?

Show thread

Bart

Sep 3, 2025

@matrix props to the transparency and my well wishes and a good night sleep to all engineers involved❤️

Show thread

Ivan Enderlin 🦀Sep 4, 2025

@matrix Kudos for the fix 💪!

Show thread

penguin42 Sep 4, 2025

@matrix Well at least it wasn't the xmas holidays this time 🙂

Show thread

AJCxZ0 Sep 4, 2025

Congratulation on the recovery, @matrix

While the postmortem should focus on what went wrong and how any likely reoccurrence of failures can be mitigated at acceptable cost, be sure to celebrate the successful recovery from catastrophic failure in production *without loss of data*, including meaningful communication to us.
Many organisations with far more resources and responsibilities fail to achieve even a fraction of this.

Show thread

Bart

Sep 5, 2025

@AJCxZ0 @matrix much love from a sysadmin managing the Synapse server for blender.org (we don’t federate but still!)

Show thread

Mr Creosote Sep 6, 2025

@matrix Thank you for taking this gargantuan effort of restoration! It seems the Afternet bridge is still down. Even the channel search answers with an error. Any chance this could be restored?

Show thread

The Matrix.org Foundation Sep 6, 2025

@mr_creosote hm, we don’t run an afternet bridge as matrix.org; it must be run by someone else who you’ll need to nudge - sorry!

Show thread

Kalos Sep 3, 2025

ya queda menos... vaya fastidio.

Show thread

PoLiTiPeT Sep 3, 2025

@matrix so you're back online it seems. Thanks 👍 😘

Show thread

Ivan Enderlin 🦀Sep 3, 2025

@matrix 💪

Show thread

T_X Sep 3, 2025

@matrix weirdly this feels like actually a positive example reinforcing the idea of a decentral fediverse, as other instances are unaffected. Also we had been discussing running an own instance at the @chaotikumev just before the outage.
I just wish there were such an easy, neat account migration feature like @Mastodon has. (And I guess I can't just ex- and import chats + keys and use SRV records to have a seamless migration?)

Show thread

T_X Sep 3, 2025

@matrix but thanks for working on this issue and sorry for whoever might have been working overtime now...

Show thread

MoveFastAndFixThings Sep 2, 2025

@matrix @lloydw !!! 👀

Show thread

Frisk Sep 2, 2025

@matrix This screams to me as stressful 24 hours for infrastructure operators of matrix.org. Please accept complimentary hugs

Show thread

Ben Sep 2, 2025

@matrix Thanks for keeping the updates coming, hopefully no more wrenches get thrown into the mess!

@matrix better run it ur own, i have very good experience with #conduit server https://conduit.rs

Conduit - Your own chat server

Conduit is a simple, fast and reliable chat server powered by Matrix. Conduit is an alternative to Synapse and tries to be lightweight and easy to install, but it is still in development.

Show thread

Thibaultmol 🌈Sep 2, 2025

@matrix best wishes for the team working on the recovery!

Show thread

Simon Carpentier Sep 2, 2025

@matrix hugs for you all!

Show thread

luca0N!Sep 3, 2025

@matrix I appreciate the transparency and in-depth explanation. Best of luck with the restoration.

Show thread

AJCxZ0 Sep 3, 2025

@matrix Godspeed, admins!

Show thread

iooioio Sep 3, 2025

@matrix Much love to the team. This incident is a reminder to me of how stable the service has been so far.

Show thread

Bernie Sep 3, 2025

@matrix Any plans to migrate away from centralized RDBMS? There are so many blob stores which can scale to petabytes and can tolerate the loss of multiple nodes without going offline.

Show thread

interru

Sep 3, 2025

@codewiz @matrix So you advocate for A (availability) of the CAP-theorem. Now the question is do you choose consistency or partition-tolerance. You can't have both.

To answer your question: Probably no because it would be contraproductive. And Matrix itself is a distributed database if your squint your eyes enough.

Show thread

Bernie Sep 3, 2025

@interru Good point. Google's Bigtable picked P: when two replicas can't communicate for some time, the replication log grows on both sides, and will eventually get synced with some policy (e.g. highest timestamp wins).

While horrifying for a banking system, it's probably a fine compromise for an IM.

Show thread

Bernie Sep 3, 2025

@interru Also, I don't know much about how Matrix federation works, but doesn't matrix.org need to store all messages for all "xyx:matrix.org" rooms, and also cache messages of rooms hosted elsewhere if at least a local user joined them?

Sounds like every large server will eventually process almost every large room in the entire network...

Show thread

Marcos Dione Sep 3, 2025

@matrix just an idea to improve backups:

Make exponential backoff like backups: last month, months 2-3 ago, mos 4-6 ago, 7-12moa, 2-3ya, etc. Or with N messages instead of N days.

Sounds like you could recover the fresher data first, then catch up, then restore backwards.

#backup #SysAdmin

Show thread

Fabian N. T.Sep 3, 2025

@mdione afaik, Matrix’ data structures are a chain of events and to verify the current state you need all(?) historical data. There are probably workarounds and servers only expose a subset for performance reasons, but my *guess* is, the underlying data wants to be complete in general.

Show thread

Marcos Dione Sep 3, 2025

@fabian uh, sounds like block chain for IM :) Maybe add some checkpoints from time to time?

Show thread