The point that the trinity of logs, metrics and traces wastes a lot of engineering effort to pre-select the right metrics (and labels) and storage (by having too many information triplicate), is a good one.

> We believe raw data based approach will transform how we use observability data and extract value from it. Yep. We have built quuxLogging on the same premise, but with more emphasis on "raw": Instead of parsing events (wide or not), we treat it fundamentally as a very large set of (usually text) lines and optimized hard on the querying-lots-of-text part. Basically a horizontally scaled (extremely fast) regex engine with data aggregation support.

Having a decent way to get metrics from logs ad-hoc completely solves the metric cardinality explosion.

▲

wvh 2 months ago | parent | next [-]

Many companies are having trouble to even keep Prometheus running without it getting OOM killed though.

I understand and agree with the problem this is trying to solve; but the solution will rival the actual business software it is observing in cost and resource usage. And hence, just like in quantum mechanics, observing it will drastically impact the event.

▲

Drahflow 2 months ago | parent | next [-]

> in cost and resource usage

Nah, it's fine. Storage of raw logs is pretty cheap (and I think this is widely assumed). For querying, two problems arise:

1. Query latency, i.e. we need enough CPUs to quickly return a result. This is solved by horizontal scaling. All the idle time can be amortized across customers in the SaaS setting (not everyone is looking at the same time).

2. Query cost, i.e. the total amount of CPU time (and other resources) spent per data scanned must be reasonable. This ultimately depends on the speed of the regex engine. We're currently at $0.05/TB scanned. And metric queries on multi-TB datasets can usually be sampled without impacting result quality much.

	▲	wvh 2 months ago \| parent [-]
		It's not the storage cost; it's the computational load (memory, CPU, sometimes network) of gathering thousands and thousands of metrics by default, most of which go unused.

▲

thewisenerd 2 months ago | parent | prev [-]

> observing it will drastically impact the event

this presumes 'metrics' are 'cheaper' than 'traces' / observability 2.0 from a setup standpoint; purely from an implementation perspective?

	▲	wvh 2 months ago \| parent [-]
		Wide events seems like it would require more memory and CPU to combine and more bandwidth due to size. I've implemented services with loggers that gather data and statistics and write out just one combined log line at the end. It's certainly more economical in regard to dev time, not sure how "one large" compares to "many small" in reality resource-wise.

▲

thewisenerd 2 months ago | parent | prev [-]

> having a decent way to get metrics from logs ad-hoc completely solves the metric cardinality explosion.

last i checked, the span metrics connector[1] was supposed to "solve" this in otel; but i'm not particularly inclined, as configurations are fixed.

any data analytics platform worth it's money should be able to do this at runtime (for specified data volume constraints, in reasonable time).

in general, structured logging should also help with this; as much as i love regex, i do not think extracting "data" from raw logs is lossless.

[1] https://github.com/open-telemetry/opentelemetry-collector-co...