Mastodawn

After 8 years of running nonstop and serving millions of users worldwide, I just shut down 2 of the 3 servers that powered Publer 😱

And this is exactly what it felt like 🫡

Emotions aside, these servers had reached their end of life. That meant we could no longer install modern libraries or deliver the latest features and security updates to our users.

Thank you to all our customers for your patience and understanding while I migrated nearly 1TB of data to the new infrastructure yesterday.

Show thread

Ervin Kalemi Mar 9

The downtime was inevitable for the time being.

In the comments, I'll share the migration steps I took and how we'll plan better for the future 👇

Show thread

Ervin Kalemi Mar 9

1. A week prior, I added a big notice on the platform and notified all users via email, in-app and push notifications of the upcoming maintenance window and downtime.

2. On Saturday evening, I launched and setup two similar AWS EC2 instances, one for the app, and one for the DB. Same configurations, same setup, but with the latest libraries and services (Mongo, Postgres, Ruby etc)

Show thread

Ervin Kalemi Mar 9

3. On Sunday morning, as announced, I set up a Cloudflare worker to intercept traffic to our platform (web, mobile, and API), and redirected traffic to this landing page, except for my office IP.

4. With the database now intact, the data was dumped, transferred, and restored to the new DB server. This step unfortunately took a long time.

5. I turned off the existing servers, and assigned their IPs to the new servers to avoid any firewall and DNS configuration changes.

Show thread

Ervin Kalemi Mar 9

6. Once I confirmed that the new server and the new DB were up and running like nothing had happened, I disabled the Cloudflare worker, marked the scheduled posts during the downtime as failed to prevent time-sensitive posts, and resumed all background jobs.

A big thank you to Claude for preparing the landing page in seconds and being my DevOps VA on a Sunday.

Show thread

Ervin Kalemi Mar 9

What could/should have gone better:

a) I should have estimated the downtime more accurately. The database size was a known variable, and the uncertainty around when services would be restored understandably frustrated customers.

b) A more fragmented database migration. Instead of blocking all services at once, parts of the system could have been migrated gradually to reduce the overall downtime.

Show thread

Ervin Kalemi Mar 9

c) While some downtime may have been inevitable due to architecture and database version changes, a different migration strategy (such as replication or a phased cutover) could likely have reduced the impact and shortened the final downtime window.

Given the urgency of the situation and the time constraints (the old database was already causing serious latency issues), I should have at least handled point a) better. No excuses.

Show thread

Ervin Kalemi Mar 9

I know Publer is not a life-or-death service, but 12 hours of downtime, regardless of the day, is unacceptable.

If you were heavily impacted by the downtime or any issues caused by the migration, please reach out to our support team via the website chat or at [email protected], and we will make it right.

Moving forward, I can confidently say that Publer is now faster, stronger, and more reliable than ever 💚