Mastodawn

Caju Pereira Nov 5, 2024

Moral of the story: While monitoring IS essential, effective troubleshooting is key to handling production incidents. It's about systematic investigation and attention to detail. Even the simplest things can trip us up! #SRE #DevOps (5)

Show thread

Caju Pereira Nov 5, 2024

The issue: The feature flag system expected a string to be exactly equals to "on", and we had mistyped a line-break at the end of this string, which made the feature flag be interpreted as "off"🤦 (4)

Show thread

Caju Pereira Nov 5, 2024

All of our monitors and SLOs were healthy, the feature flag system was behaving exactly like we expected and hadn't been changed in a long time. Now, take a guess! What do you think was the underlying cause of this incident? (3)

Show thread

Caju Pereira Nov 5, 2024

The line of code that wasn't working was a single "if" statement. At one point, I was paged and we had 4 engineers in the call, some with masters in computer science, debugging the issue. We got to the point of adding a breakpoint in production! (2)

Caju Pereira Nov 5, 2024

#SRE Story: A team was releasing a new feature to production, which had been scheduled with customers for a long time. The code was already released to production, but it was toggled off behind a feature flag. When they toggled it on, nothing happened. (1)

Caju Pereira Nov 5, 2024

Just released Week 4 of my #52WeeksOfSRE Series!
Go check it out and learn about the foundations of #IncidentManagement within #SRE and #DevOps teams

https://jpereira.me/week-4-incident-management-foundations/

Week 4: Incident Management: Key Strategies for SRE and DevOps Teams

Throughout this post, I hope to share essential site reliability engineering practices that can transform your incident management process.

J. Pereira

Caju Pereira Nov 5, 2024

Week 3 of "52 Weeks of SRE" is released!

Learn How to Define Effective Service Level Objectives (SLOs) for Your Organization
https://jpereira.me/week-3-service-level-objectives-slos/

#SRE #SLO #Observability #Monitoring #Grafana #Prometheus #SiteReliabilityEngineering

Week 3: How to Define Effective Service Level Objectives (SLOs) for Your Organization

Learn how to implement Service Level Objectives (SLOs) from the fundamentals to practice! Learn to set reliability targets on Prometheus and monitor them with Grafana.

J. Pereira

Caju Pereira Nov 5, 2024

As a companion to the Monitoring Fundamentals post, here's "Building and Deploying a Robust Monitoring Solution for your Applications":
https://jpereira.me/building-and-deploying-a-robust-monitoring-solution-for-your-applications/

#sre #monitoring #observability #docker #prometheus #grafana

Building and Deploying a Robust Monitoring Solution for your Applications

Step-by-step guide: Implement production-grade monitoring using Prometheus and Grafana with a Golang microservice.

J. Pereira

Caju Pereira Nov 5, 2024

Excited to announce the second post in my "52 Weeks of SRE" series. Here's Week 2: Monitoring Fundamentals
https://jpereira.me/week-2-monitoring-fundamentals/

#SRE #observability #monitoring #prometheus #grafana

Week 2: Monitoring Fundamentals

Learn monitoring fundamentals: Discover how to effectively use metrics, set up meaningful alerts, and build informative dashboards to keep your systems reliable and observable.

J. Pereira

Caju Pereira Nov 5, 2024

Just wrapped up a thorough article on effectively securing your applications, inspired by the recent #ShipFast events and discussions that it raised around the #indiehacker community.
https://jpereira.me/security-first-essential-best-practices-for-securing-your-application/

#SRE #CyberSecurity

Security First: Best Practices for Effectively Securing your Applications

Learn essential web security practices through real examples: from securing API endpoints and handling sensitive data to implementing proper authentication and webhook processing.

J. Pereira