Remix.run Logo
the_duke 9 hours ago

I'm curios: are many people here actually still running mainline Prometheus over one of the numerous compatible solutions that are more scalable and have better storage? (Mimir, Victoria, Cortex, OpenObserve, ...)

robinhoodexe 8 hours ago | parent | next [-]

We’re running standard Prometheus on Kubernetes (14 onprem Talos clusters, total of 191 nodes, 1.1k cpu cores, 4.75TiB memory and 4k pods). We use Thanos to store metrics in self-hosted S3 (seaweedfs) with 30 days retention, aggressively downsample after 3 days.

It works pretty good tbh. I’m excited about upgrading to version 3, as is does take a lot of resources to keep going, especially on clusters with a lot of pods being spawned all the time.

ChocolateGod 7 hours ago | parent [-]

> We use Thanos to store metrics in self-hosted S3 (seaweedfs) with 30 days retention, aggressively downsample after 3 days.

Any reason to not just use Mimir for this?

aorth 7 hours ago | parent | prev | next [-]

Using Victoria Metrics here. Very easy to set up and run. I monitor under 100 hosts and resource usage is low, performance is good.

One gripe is that they recently stopped publishing tarballs for LTS versions, which caused some grumbling in the community. Fair enough since they are developing for free, but felt like a bait and switch.

majewsky 6 hours ago | parent | prev | next [-]

Regular Prometheus inside clusters for collection and alerting, Thanos for cross-cluster aggregation and long retention.

never_inline 7 hours ago | parent | prev | next [-]

I am curious to hear from people on this forum, at what point will people practically cross the limits of prometheus, and straightforward division (eg, different prometheus across clusters and environments) does not work?

hagen1778 2 hours ago | parent [-]

It usually comes with increase of active series and churn rate. Of course, you can scale Prometheus horizontally by adding more replicas and by sharding scrape targets. But at some point you'd like to achieve the following:

1. Global query view. Ability to get metrics from all Prometheis with one request. Or just simply not thinking which Prometheus has data you're looking for.

2. Resource usage management. No matter how you try, scrape targets can't be sharded perfectly. So you'll end up with some Prometheis using more resources than others. This could backfire in future in weird ways, reducing stability of the whole system.

raffraffraff 9 hours ago | parent | prev | next [-]

Nope. Mimir. Before that, Thanos.

rad_gruchalski 8 hours ago | parent | prev [-]

Mimir