Mastodawn

Mar 14, 2024

What do fellow sysadmins do with regards to alert emails. Our system is let all the systems email the one address. It's hoped between the 3 senior admins we'll catch important stuff. But these days it's become very noisy and with staff shortages this week some issues have been over looked, causing a short outage.

Do you have elaborate email filters? Fancy triaging systems? Does every alert open a ticket?

#sysadmin #alerts #email #thesystemisdown #spidermanpointingmeme

Show thread

-dsr- (hypoparenthetically)Mar 14, 2024

@gnuplusmatt

There are three email addresses for production: alert, warn, and routine. Alert opens a ticket and makes the duty pager go off. Routine goes straight to a folder that nobody reads unless they are looking for evidence that something went right. Warning goes to sysadmins' email but does not set off a pager.

If it goes to alert, it had better be important enough to wake someone up. If the ticket resolution is "this didn't need to wake anyone", the next ticket is to demote that particular condition.

If it goes to warn, it must be actionable, but not so urgent that it can't be handled the next business day. Warn-level email does not generate a ticket automatically.

Dev and test systems are not allowed to generate alert mail. Internal systems only get to generate alert mail if it is urgent to cure the issue before most employees start their work day.

Does that help? Happy to talk this and similar things over, if you'd like.

Show thread

GNU/Matt

Mar 14, 2024

@dashdsrdash we don't have official on call, but I like the different addresses for different classification idea.

Show thread

-dsr- (hypoparenthetically)

@gnuplusmatt

Pretty much everything can send email, so it's relatively easy to decide for each process what issues are routine, warnings or alerts. Everyone involved should know that they can argue for a change -- or if you don't do formal change management, can get rough consensus.

Monitoring issues often need to be customized per-system: 96% filesystem usage can be a deadly fix-immediately problem on one system, and also a boring buy-new-disks-next-month problem on another.