> Most sysadmins at this point would just configure a new filtered metric and start collecting data… for a month. While the system is broken. Wrong needle? Start looking through the haystack again with another new custom metric for another month.

In this example i feel like it is treating metrics as the only telemetry signal that operators have access to. Once the metrics indicate an issue, we can pull existing logs, traces and profiles to dig into it and eventually capture dumps.

I'm totally onboard with the idea of rich trace metadata, but it seems more evolutionary than revolutionary

▲

jiggawatts 2 months ago | parent [-]

If the logs and traces contain enough info to reproduce the metric, then you don't need a separate metric! That's basically the point here: you can derive arbitrary metrics from wide logs.

▲

valyala 2 months ago | parent [-]

You can't derive system metrics such as the usage of CPU, RAM, disk IO, disk space and network, from wide events.

	▲	algorithmsRcool 2 months ago \| parent [-]
		Well, I could just enrich the trace/event with sample data from CPU, RAM, Disk I/O, etc...