Remix.run Logo
robinhoodexe 8 months ago

We’re running standard Prometheus on Kubernetes (14 onprem Talos clusters, total of 191 nodes, 1.1k cpu cores, 4.75TiB memory and 4k pods). We use Thanos to store metrics in self-hosted S3 (seaweedfs) with 30 days retention, aggressively downsample after 3 days.

It works pretty good tbh. I’m excited about upgrading to version 3, as is does take a lot of resources to keep going, especially on clusters with a lot of pods being spawned all the time.

xgbi 7 months ago | parent | next [-]

Hey robinhood, any feedback on Talos?

We've been using Talos for our internal clusters for a while, but with quite small ones (3 kube node, 5 worker nodes).

Upgrading has been generally a non event, and we're quite happy with them.

How do you deploy Thanos ? In one of the clusters ?

robinhoodexe 7 months ago | parent [-]

We’re extremely pleased with Talos. Much more secure than Azure (our cloud of choice, unfortunately) which run a full-blown Ubuntu underneath. We haven’t run into any issues with Talos and upgrading is super easy with the talosctl tool, both Kubernetes and Talos version.

We currently have a thanos instance in each cluster. We could move it to a separate cluster to reduce some overhead, but the current approach works. We’re ingesting about 60Gi per day of metrics into the S3 bucket, so we might have to optimise that.

ChocolateGod 8 months ago | parent | prev [-]

> We use Thanos to store metrics in self-hosted S3 (seaweedfs) with 30 days retention, aggressively downsample after 3 days.

Any reason to not just use Mimir for this?

robinhoodexe 7 months ago | parent [-]

I can’t recall the reason for using thanos over mimir to be honest. I think thanos seemed like a good choice given it’s part of the kube-prometheus-stack community helm charts.