What do fellow sysadmins do with regards to alert emails. Our system is let all the systems email the one address. It's hoped between the 3 senior admins we'll catch important stuff. But these days it's become very noisy and with staff shortages this week some issues have been over looked, causing a short outage.

Do you have elaborate email filters? Fancy triaging systems? Does every alert open a ticket?

#sysadmin #alerts #email #thesystemisdown #spidermanpointingmeme

@gnuplusmatt If alerts aren't actionable, shut them off. Otherwise, alert fatigue (yes, that's a thing, research it) will make sure that alerts will not be taken seriously enough in general.

@monospace we do have a lot of alerts that don't need action, classifying these better is probably the first task.

Alert Fatigue is probably a very accurate description, I'll look it up

@gnuplusmatt Yes, classifying those alerts and minimizing the amount of attention they take from you is a great first step. It might feel tedious, but it's a "once in a while tedious", as opposed to the "every day tedious" you're experiencing now.

I've been on call since I started my web hosting business in 2010. I can share lots more of my experience if you're interested.

@gnuplusmatt

There are three email addresses for production: alert, warn, and routine. Alert opens a ticket and makes the duty pager go off. Routine goes straight to a folder that nobody reads unless they are looking for evidence that something went right. Warning goes to sysadmins' email but does not set off a pager.

If it goes to alert, it had better be important enough to wake someone up. If the ticket resolution is "this didn't need to wake anyone", the next ticket is to demote that particular condition.

If it goes to warn, it must be actionable, but not so urgent that it can't be handled the next business day. Warn-level email does not generate a ticket automatically.

Dev and test systems are not allowed to generate alert mail. Internal systems only get to generate alert mail if it is urgent to cure the issue before most employees start their work day.

Does that help? Happy to talk this and similar things over, if you'd like.

@dashdsrdash we don't have official on call, but I like the different addresses for different classification idea.

@gnuplusmatt

Pretty much everything can send email, so it's relatively easy to decide for each process what issues are routine, warnings or alerts. Everyone involved should know that they can argue for a change -- or if you don't do formal change management, can get rough consensus.

Monitoring issues often need to be customized per-system: 96% filesystem usage can be a deadly fix-immediately problem on one system, and also a boring buy-new-disks-next-month problem on another.

@gnuplusmatt The biggest thing I've learned is that the "send it to a slack channel or email and hope someone sees it" alerting method breaks down over time. Alert fatigue, missed alerts while everyone is asleep, and team members becoming resentful of perceived (or real) uneven distribution of responses are major risks. I've found that a dedicated on-call rotation solves so many of these problems. The biggest thing, in general, is making sure that every alert has one single owner the second it's generated. We've all sat and stared and played chicken with our teammates not wanting to respond to a Slack message or notification. That can't happen if the alert automatically gets assigned to one single human. I've only used Pager Duty for this purpose, but there are lots of services that can handle it for you.
@gnuplusmatt I always try to use a unique alias address for the given system, even if it points to the same distro group. this allows for easier identification and filtering.
@gnuplusmatt Alerts that require no response don’t get generated.
Alerts that are informational include a hashtag so they can all be easily filtered to an archive folder and referred to later if ever needed.
Alerts that need non-urgent action generate a ticket with an appropriate priority.
Alerts needing immediate action trigger a call to the oncall person.