the matrix.org homeserver is having problems: https://status.matrix.org/incidents/mm9hdm78svgv apologies for the inconvenience…
Database incident

Matrix's Status Page - Database incident.

So: the matrix.org database secondary lost its FS due to a RAID failure earlier today (11:17 UTC). Then, we lost the primary at 17:26. We're trying to restore the primary DB FS (which could be fastish), while also doing a point-in-time backup restore from last night (which takes >10h). We believe the incremental DB traffic since last night is intact however. Apologies for the downtime; folks on their own homeserver are of course not impacted.
Sorry, but it's bad news: we haven't been able to restore the DB primary filesystem to a state we're confident in running as a primary (especially given our experiences with slow-burning postgres db corruption). So we're having to do a full 55TB DB snapshot restore from last night, which will take >10h to recover the data, and then >4h to actually restore, and then >3h to catch up on missing traffic. Huge apologies for the outage. Again, folks using their own homeservers are not impacted.
Status update: we’re 47TB through restoring the 55TB db snapshot of the matrix.org DB, but then have to rebuild the DB and replay the subsequent 17h of DB traffic, which will take several hours. Thank you for your patience, and apologies once again for the outage.
Status update: we've restored the 55TB snapshot and subsequent incremental backups, and are about to replay the remaining traffic since the backup. There are still several unknowns, but if things go well the matrix.org instance should be back in 3-4 hours.
Right, matrix.org is back online as of 17:00 UTC. The server is struggling a bit as it catches up. Huge apologies again for the outage; postmortem + ways to avoid a repeat will be forthcoming. See also https://www.theregister.com/2025/09/03/matrixorg_raid_failure/ & https://www.heise.de/en/news/Matrix-main-server-down-millions-of-users-affected-10630524.html. Thanks all for your patience.
Matrix.org homeserver grinds to a halt after RAID meltdown

: Engineers wrangle 55 TB restore and traffic replay as millions of messages queue up

The Register
@matrix I’m really interested in your post mortem from a professional point of view.
@matrix I should really grab that funny domain name I've been eyeing and host my own instance.

@matrix 🥺

That must have been rough and tough.
We love you! 🧡

@matrix Thanks to all the incredible people at Matrix who managed to fix this. This must have been a horrible, stressful day.

@matrix

Thanks for the updates as it went along!

And thanks to everyone who contributed to the fix!

@matrix On my end, I still have issues when trying to log in.
@THB_STX we’re not aware of any issues - can you send details to [email protected] please?
@matrix props to the transparency and my well wishes and a good night sleep to all engineers involved❤️
@matrix Well at least it wasn't the xmas holidays this time 🙂

Congratulation on the recovery, @matrix

While the postmortem should focus on what went wrong and how any likely reoccurrence of failures can be mitigated at acceptable cost, be sure to celebrate the successful recovery from catastrophic failure in production *without loss of data*, including meaningful communication to us.
Many organisations with far more resources and responsibilities fail to achieve even a fraction of this.

@AJCxZ0 @matrix much love from a sysadmin managing the Synapse server for blender.org (we don’t federate but still!)
@matrix Thank you for taking this gargantuan effort of restoration! It seems the Afternet bridge is still down. Even the channel search answers with an error. Any chance this could be restored?
@mr_creosote hm, we don’t run an afternet bridge as matrix.org; it must be run by someone else who you’ll need to nudge - sorry!
ya queda menos... vaya fastidio.
@matrix so you're back online it seems. Thanks 👍 😘
@matrix weirdly this feels like actually a positive example reinforcing the idea of a decentral fediverse, as other instances are unaffected. Also we had been discussing running an own instance at the @chaotikumev just before the outage.
I just wish there were such an easy, neat account migration feature like @Mastodon has. (And I guess I can't just ex- and import chats + keys and use SRV records to have a seamless migration?)
@matrix but thanks for working on this issue and sorry for whoever might have been working overtime now...
@matrix This screams to me as stressful 24 hours for infrastructure operators of matrix.org. Please accept complimentary hugs 
@matrix Thanks for keeping the updates coming, hopefully no more wrenches get thrown into the mess!
@matrix better run it ur own, i have very good experience with #conduit server https://conduit.rs
Conduit - Your own chat server

Conduit is a simple, fast and reliable chat server powered by Matrix. Conduit is an alternative to Synapse and tries to be lightweight and easy to install, but it is still in development.

@matrix best wishes for the team working on the recovery!
@matrix I appreciate the transparency and in-depth explanation. Best of luck with the restoration.
@matrix Much love to the team. This incident is a reminder to me of how stable the service has been so far.
@matrix Any plans to migrate away from centralized RDBMS? There are so many blob stores which can scale to petabytes and can tolerate the loss of multiple nodes without going offline.
@codewiz @matrix So you advocate for A (availability) of the CAP-theorem. Now the question is do you choose consistency or partition-tolerance. You can't have both.

To answer your question: Probably no because it would be contraproductive. And Matrix itself is a distributed database if your squint your eyes enough.

@interru Good point. Google's Bigtable picked P: when two replicas can't communicate for some time, the replication log grows on both sides, and will eventually get synced with some policy (e.g. highest timestamp wins).

While horrifying for a banking system, it's probably a fine compromise for an IM.

@interru Also, I don't know much about how Matrix federation works, but doesn't matrix.org need to store all messages for all "xyx:matrix.org" rooms, and also cache messages of rooms hosted elsewhere if at least a local user joined them?

Sounds like every large server will eventually process almost every large room in the entire network...

@matrix just an idea to improve backups:

Make exponential backoff like backups: last month, months 2-3 ago, mos 4-6 ago, 7-12moa, 2-3ya, etc. Or with N messages instead of N days.

Sounds like you could recover the fresher data first, then catch up, then restore backwards.

#backup #SysAdmin

@mdione afaik, Matrix’ data structures are a chain of events and to verify the current state you need all(?) historical data. There are probably workarounds and servers only expose a subset for performance reasons, but my *guess* is, the underlying data wants to be complete in general.
@fabian uh, sounds like block chain for IM :) Maybe add some checkpoints from time to time?

@matrix good luck on the remediation actions 🫡

#matrixdown

@matrix This is why we need more decentralization which happens to be the goal of matrix.
@hisold @matrix when can we hope for account migration properly ? (the main problem in this case imho)
@olm_e @hisold @matrix
That would be a great feature!
@hisold @matrix The Matrix homeserver is so unwieldy and such a massive resource hog that very few people are willing to host it themselves.
@matrix jokes aside, RAID failures are NOT fun. Props for the quick reaction and godpseed!
@matrix never heard of a hot spare ?
@matrix What was the RAID failure? Have you considered using RAID-Z with ZFS?

@matrix as an advertisement for decentralization this is a bit harsh, but definitely effective!

(J/k, of course. Good luck with the recovery and thanks!)