Remix.run Logo
wvh 11 hours ago

Many companies are having trouble to even keep Prometheus running without it getting OOM killed though.

I understand and agree with the problem this is trying to solve; but the solution will rival the actual business software it is observing in cost and resource usage. And hence, just like in quantum mechanics, observing it will drastically impact the event.

Drahflow 8 hours ago | parent | next [-]

> in cost and resource usage

Nah, it's fine. Storage of raw logs is pretty cheap (and I think this is widely assumed). For querying, two problems arise:

1. Query latency, i.e. we need enough CPUs to quickly return a result. This is solved by horizontal scaling. All the idle time can be amortized across customers in the SaaS setting (not everyone is looking at the same time).

2. Query cost, i.e. the total amount of CPU time (and other resources) spent per data scanned must be reasonable. This ultimately depends on the speed of the regex engine. We're currently at $0.05/TB scanned. And metric queries on multi-TB datasets can usually be sampled without impacting result quality much.

thewisenerd 10 hours ago | parent | prev [-]

> observing it will drastically impact the event

this presumes 'metrics' are 'cheaper' than 'traces' / observability 2.0 from a setup standpoint; purely from an implementation perspective?