So, I've been using Thanos to receive and store my prometheus metrics long term in a self hosted S3 bucket. Thanos also acts as a datasource for my dashboards in Grafana, and provides a Ruler, which evaluates alerting rules against my metrics and forwards them to my alertmanager. It's ok. It's certainly got it's downsides, which I can go into later, but I've thinking... what about Mimir?

How do you all feel about Grafana's Mimir (source on GitHub)? It's AGPL and seems to literally be a replacement of Thanos, which is Apache 2.0.

Thanos description from their website:

Open source, highly available Prometheus setup with long term storage capabilities.

Mimir description from their website:

...open source software project that provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus and OpenTelemetry metrics.
Both with work with alloy and prometheus alike. Both require you to configure initially confusing hashrings and replication parameters. Both have a bunch of large companies adopting them, so... now I feel conflicted. Should I try mimir? Poll in reply.

#thanos #prometheus #alloy #grafana #observability #monitoring #kubernetes #k8s #foss #sre

If you've tried both Thanos and Mimir, which do you prefer? Feel free to comment why below 

#thanos #prometheus #alloy #grafanaAlloy #grafana #observability #monitoring #kubernetes #k8s #foss #sre #mimir #grafanaMimir

Thanos
55.6%
Mimir AKA Grafana Mimir
44.4%
Poll ended at .
@jessebot the fact that Mimir is AGPL licensed immediately dequalifies it for usage in many organizations

@aveiga Yeah, I know. You should work to fight that in your organization though. I know it's not easy because I've done it at big international companies that are 10k+ employees and it's pain, but you can get exceptions for it in through your legal department. I know it takes months or even years. You should do it anyway though because it's a very good license that fosters a good FOSS environment.

It's also worth noting that if the company doesn't allow for using any open source software that is GPL/AGPL, and they don't allow an exception request process to do so, then I probably can't use a significant portion of software I rely on daily and so I don't want to work there. I already don't work for defense/police, oil/gas/car, or banks/fintech companies though, so that helps narrow down a lot of potential headache on my part.

@jessebot the thing is, there is no jurisprudence, as such it is irresponsible to push for it. No proper legal department should ever agree to put their company through that risk. And I sure don’t want that to be my responsibility. Historically, we’ve had other companies stating that our usage of their AGPL software would required paid licenses (minIO, for example)

@aveiga

it is irresponsible to push for it

No proper legal department should ever agree to put their company through that risk.

We disagree here. It doesn't sound like you used Mimir at scale though given your response.

@[email protected] @[email protected] That's true, but so is Grafana and I've managed to get that in everywhere I've asked aside from one large multi-national logistics company. It's not a sure-thing but if you're using/operating it for yourself and not as part of a product then I usually have hope. Worst they can do is say "No" 🤷
@cloudymax @jessebot the thing is, there is no jurisprudence, as such it is irresponsible to push for it. No proper legal department should ever agree to put their company through that risk. And I sure don’t want that to be my responsibility. Historically, we’ve had other companies stating that our usage of their AGPL software would required paid licenses (minIO, for example). For purely internal usage it’s a no brainer, if there is customer exposure, even embedded, than it’s a no go

@aveiga @jessebot > For purely internal usage it’s a no brainer, if there is customer exposure, even embedded, than it’s a no go

Yep, that's been my experience.

@jessebot I don't miss Thanos, not even slightly. While the slight impedances with Google Managed Prometheus are a PITA (eg the slightly different CRDs and messy-to-GitOps OperatorConfig stuff), the details that federation and aggregation across projects is just a few clicks or a trivial Terraform stanza away, and you automatically get 2 years of metric retention with no fuss is pretty compelling.
@jcl To clarify, are you using Mimir? I do not want to use a managed service at this time.
@jessebot Nope, just a happy GMP user.

The reason Thanos feels just "okay" right now is that it was better when I was working on smaller scales, but now it feels clunky anytime I need to restart it, because it takes such a long time to catch up on all the metrics it missed after all my Thanos receivers restart, so scaling them in and out is really difficult and almost always results in metrics downtime (which to be fair, are then available again after 10-30 minutes, depending on how long they were out and what was going on), but this doesn't feel very production ready around that. To a user of grafana, let's say @cloudymax, they just see a metrics outage, even though we both know prometheus is still remote writing to Thanos and keeps all it's data locally on a PVC until Thanos is up again, so the metrics WILL become visible again in Grafana, it feels weird to allow that kind of outage as an SRE. It doesn't give me production vibes.

Maybe it's my replication factor. Maybe it's that I need to use the prometheus thanos sidecar model instead, but how does this replication not work out of the box better? It's so delicate and may be highly available when it's provisioned correctly and you know your spikes so you have planned capacity and scaling configured, but it fails the upgrade of existing recievers test. Maybe the Bitnami Thanos helm chart just needs some work. Maybe I need to set a custom readiness probe, but this feels clunky?

Am I alone? Does Thanos feel clunky to anyone else operating at scale? I self host so much stuff. I have so many metrics. There has to be a way to auto-upgrade these receivers and them not just absolute panic when they restart. The logging is so unhelpful too, even when it's set to debug.

I'm appreciative of the software, but that I'm considering Mimir to begin with, and after what a pain it was to even get this far, and how prone I am to sunk cost fallacy because I work on these things in my free time and I don't have a ton of that... I feel like is is telling that I would consider switching at all. It's only been barely 3/4 year since @cloudymax (mostly them) and I set the whole stack up. We didn't even get to our big announcement post on social media like we normally do for big complicated FOSS app stacks.

@jessebot I don't trust Grafana to maintain Mimir for long enough after how quickly they deprecated Cortex.

Have you considered VictoriaMetrics? It performs and compacts really well, and while it can't use S3 for warm storage it runs well even on spinning rust, making long term retention on block storage viable.

@claus No, I hadn't considered VictoriaMetrics, but I'll give it a looksie. Thanks for mentioning it. I really do love S3 as the universal storage backend though. It makes stateless a bit more achievable in a lot of instances.

But yeah, they also deprecated Promtail, but Alloy has been a fine replacement and I have no complaints. I don't care if you deprecate as long as you replace.

@jessebot I tried Cortex and Mimir, they are both nice, but I found them to be a lot more complicated than Thanos, and require a bit more hand holding.

My thanos setup "just works", whereas with both Mimir and Cortex, I had to give it a lot more attention,, so I switched back.