Remix.run Logo
the_duke 7 months ago

I'm curios: are many people here actually still running mainline Prometheus over one of the numerous compatible solutions that are more scalable and have better storage? (Mimir, Victoria, Cortex, OpenObserve, ...)

robinhoodexe 7 months ago | parent | next [-]

We’re running standard Prometheus on Kubernetes (14 onprem Talos clusters, total of 191 nodes, 1.1k cpu cores, 4.75TiB memory and 4k pods). We use Thanos to store metrics in self-hosted S3 (seaweedfs) with 30 days retention, aggressively downsample after 3 days.

It works pretty good tbh. I’m excited about upgrading to version 3, as is does take a lot of resources to keep going, especially on clusters with a lot of pods being spawned all the time.

xgbi 7 months ago | parent | next [-]

Hey robinhood, any feedback on Talos?

We've been using Talos for our internal clusters for a while, but with quite small ones (3 kube node, 5 worker nodes).

Upgrading has been generally a non event, and we're quite happy with them.

How do you deploy Thanos ? In one of the clusters ?

robinhoodexe 7 months ago | parent [-]

We’re extremely pleased with Talos. Much more secure than Azure (our cloud of choice, unfortunately) which run a full-blown Ubuntu underneath. We haven’t run into any issues with Talos and upgrading is super easy with the talosctl tool, both Kubernetes and Talos version.

We currently have a thanos instance in each cluster. We could move it to a separate cluster to reduce some overhead, but the current approach works. We’re ingesting about 60Gi per day of metrics into the S3 bucket, so we might have to optimise that.

ChocolateGod 7 months ago | parent | prev [-]

> We use Thanos to store metrics in self-hosted S3 (seaweedfs) with 30 days retention, aggressively downsample after 3 days.

Any reason to not just use Mimir for this?

robinhoodexe 7 months ago | parent [-]

I can’t recall the reason for using thanos over mimir to be honest. I think thanos seemed like a good choice given it’s part of the kube-prometheus-stack community helm charts.

aorth 7 months ago | parent | prev | next [-]

Using Victoria Metrics here. Very easy to set up and run. I monitor under 100 hosts and resource usage is low, performance is good.

One gripe is that they recently stopped publishing tarballs for LTS versions, which caused some grumbling in the community. Fair enough since they are developing for free, but felt like a bait and switch.

yla92 7 months ago | parent [-]

Same. It's rock solid, low resource usage and good performance. The "single" version mostly just works.

never_inline 7 months ago | parent | prev | next [-]

I am curious to hear from people on this forum, at what point will people practically cross the limits of prometheus, and straightforward division (eg, different prometheus across clusters and environments) does not work?

hagen1778 7 months ago | parent [-]

It usually comes with increase of active series and churn rate. Of course, you can scale Prometheus horizontally by adding more replicas and by sharding scrape targets. But at some point you'd like to achieve the following:

1. Global query view. Ability to get metrics from all Prometheis with one request. Or just simply not thinking which Prometheus has data you're looking for.

2. Resource usage management. No matter how you try, scrape targets can't be sharded perfectly. So you'll end up with some Prometheis using more resources than others. This could backfire in future in weird ways, reducing stability of the whole system.

majewsky 7 months ago | parent | prev | next [-]

Regular Prometheus inside clusters for collection and alerting, Thanos for cross-cluster aggregation and long retention.

rad_gruchalski 7 months ago | parent | prev | next [-]

Mimir

rapphil 7 months ago | parent [-]

There is also Cortex metrics, from which Mimir was forked from

rad_gruchalski 7 months ago | parent [-]

Mimir, Loki, Tempo and Alloy.

raffraffraff 7 months ago | parent | prev [-]

Nope. Mimir. Before that, Thanos.