Server monitoring 101

I have been running a @yunohost server for ~5 years now, but there is one question I have never been able to answer: how loaded is my server? 👀

I know, I am a terrible sysad (actually, I am not a sysad, at all), because I have no idea how to determine:

  • if my server is running smoothly
  • if the server is under stress, and why
  • what applications are the heaviest
  • if there is the possibility of installing more apps
  • when peaks of stress are happening and what is causing them

In general, I would like to understand the fundamentals of server monitoring: what are the most critical metrics and what do they mean? What parameters do I have to keep an eye on?

I installed Prometheus and Grafana, but then I realized I have absolutely no idea what to do next… Do you have any suggestions?

I thought about watching some video tutorials, but I would not really know how they would relate to YunoHost installations.

Please, if possible reply in this thread of the YunoHost forum, so that we can keep track of this useful information also for others in the future. 🌻

Once I will have learned the basics, I would be very happy to write some pointers about this in the documentation, or an essential YunoHost Monitoring tutorial.

#sysAd #YunoHost #askFedi #help #systemAdministration #serverMonitoring #server #selfHosting #selfHost #Linux

YunoHost Monitoring 101

Hi everyone! I have been running a YunoHost server for ~5 years now, but there is one question I have never been able to reply to: how loaded is my server? 👀 I am a terrible administrator, I have no idea how to determine: if my server is running smoothly if the server is under stress, and why what applications are the heaviest if there is the possibility of installing more apps when peaks of stress are happening and what is causing them In general, I would like to understand the fundam...

YunoHost Forum

@tommi @yunohost

When you don't know what you're looking for, prometheus is overkill.

(Prometheus is usually overkill.)

Open a terminal window. SSH to your server. Install btop (Most Linux distros have it available.)

Run btop.

You now have graphs in your terminal to show you CPU load, CPU temperature, RAM use, Swap use, disk space use, disk I/O, and network I/O.

If your RAM is full and your swap is in use, you don't have enough memory: either buy and install more or stop running so much stuff. There is a tree of processes and they can be ordered by CPU use or memory use.

If your RAM is not full and some swap is in use, that's fine.

If your CPUs are all busy all the time, the temperature is high, or your network usage seems high, those are all things to be concerned about.

@dashdsrdash
"Prometheus is usually overkill"
Well, it somewhat depends. It's a somewhat different approach to monitoring. "Is my site working correctly" is usually "open http connection and see if I get a 200 code, and possibly a string in returned data". That's how nagios-age monitoring does it.
Prometheus can also do that... But instead of such "black box monitoring" / behaviour based monitoring, it has the approach of "white box monitoring", i.e.
[1/3] @tommi @yunohost
each of components exposes detailed information about its internal workings. Number of requests and processing time of web server, backend response times, database load, interrupts on CPU, latency on disk reads... Which can get you very detailed information when you know what you're looking at/for, and be totally overwhelming when you don't.
[2/3]
@tommi @yunohost @dashdsrdash
@tommi
I would have to think whether I would recommend it at this point in the conversation, since it has caveats, security implications, etc, and I *like* Prometheus, but Zabbix exists, and may be easier to start with.
[3/3]
@yunohost @dashdsrdash
@viq @tommi @yunohost @dashdsrdash But Zabbix is overkill too, in this case…
@breizh @tommi @dashdsrdash @yunohost uh, that heavily depends. From the stuff mentioned in the initial post, "server running smoothly/under stress", "when peaks of stress are happening", "is there possibility of installing more apps" are pretty easy graphs to generate (though "server running smoothly" is somewhat fractal in complexity). [1/n]
@breizh @tommi @dashdsrdash @yunohost But "why is server under stress", "which applications are the heaviest", "what is causing peaks of stress" are again somewhat fractal in complexity. They require gathering *MUCH* more data, possibly including logs, a way to present both abstracted version and diving into details, and ability to search - and each of those requirements increases complexity and resource usage, and someone who knows what they're doing having set it up. [2/n]
@breizh @tommi @dashdsrdash @yunohost Where "someone who knows what they're doing" can be the user, meaning they have to understand the system, understand the tools, and figure out how to get out of it all what's useful to them. Or it can be someone having set up such monitoring system, and made it generic enough for most cases - meaning it has all those features, which means it needs to be heavy enough to be able to support them. [3/n]
@breizh @tommi @dashdsrdash @yunohost so for "does system usage look OK", simple graphite metrics from CPU and network card and disk usage will do that. Knowing *what* causes increased load, or even more so, what caused increased load 3 days ago, requires a lot of details, and increases complexity a lot.
Though having read the description of monitorix that someone mentioned, if it's as lightweight as claimed and does what it claims, with proper configuration it may fit the request. [4/4]

@viq @tommi @dashdsrdash @yunohost I’m a sysadmin and I love monitoring everything so I can to know exactly what’s going on.

But we're talking about a small personal server, managed with an easy-to-use managing interface for someone that seems to just starting out.

That's how I started out, too, and now you're getting into some issues that are certainly relevant to a more advanced use, but in my opinion far too advanced for the case we're interested in here. It just seems… off-topic to me.

@breizh @tommi @dashdsrdash @yunohost being able to answer the "why / what causes it" in my view/experience does lead to the advanced use.
I admit I may have skewed perception, but for a single host, prometheus and grafana should not generate noticeable load. I have a host on which docker reports those containers together use 150MB RAM.
On yet another hand, on the list of applications I see Cockpit, which should allow to find out what was asked.