We’re running standard Prometheus on Kubernetes (14 onprem Talos clusters, total of 191 nodes, 1.1k cpu cores, 4.75TiB memory and 4k pods). We use Thanos to store metrics in self-hosted S3 (seaweedfs) with 30 days retention, aggressively downsample after 3 days.

It works pretty good tbh. I’m excited about upgrading to version 3, as is does take a lot of resources to keep going, especially on clusters with a lot of pods being spawned all the time.

▲

xgbi 7 months ago | parent | next [-]

Hey robinhood, any feedback on Talos?

We've been using Talos for our internal clusters for a while, but with quite small ones (3 kube node, 5 worker nodes).

Upgrading has been generally a non event, and we're quite happy with them.

How do you deploy Thanos ? In one of the clusters ?

	▲	robinhoodexe 7 months ago \| parent [-]
		We’re extremely pleased with Talos. Much more secure than Azure (our cloud of choice, unfortunately) which run a full-blown Ubuntu underneath. We haven’t run into any issues with Talos and upgrading is super easy with the talosctl tool, both Kubernetes and Talos version. We currently have a thanos instance in each cluster. We could move it to a separate cluster to reduce some overhead, but the current approach works. We’re ingesting about 60Gi per day of metrics into the S3 bucket, so we might have to optimise that.

▲

ChocolateGod 8 months ago | parent | prev [-]

> We use Thanos to store metrics in self-hosted S3 (seaweedfs) with 30 days retention, aggressively downsample after 3 days.

Any reason to not just use Mimir for this?

	▲	robinhoodexe 7 months ago \| parent [-]
		I can’t recall the reason for using thanos over mimir to be honest. I think thanos seemed like a good choice given it’s part of the kube-prometheus-stack community helm charts.