▲ | Ask HN: How do you expose telemetry data to non-tech? | ||||||||||||||||
10 points by mehdig10 14 hours ago | 6 comments | |||||||||||||||||
Hey, We have a ton of logging/metrics in kibana/prometheus, so whenever there is an issue with some client data feed or whatever, support asks us to check what happened. Since the problem might not be on our side, developers started giving non-tech teams access to those tools. Feels a bit weird to have a non-tech build a grafana dashboard, and do some sort of debugging, but even if it's clumsy they do seem to enjoy the visibility they get over the system, and developers are no longer in charge of debugging every little thing. Is this specific to our business ? Or is this something you have experienced as well ? If so how were you able to expose this data to non-tech people. | |||||||||||||||||
▲ | iKlsR 13 hours ago | parent | next [-] | ||||||||||||||||
Here's what I did that worked for last two teams I managed. We used slack as a hub for everything, we had some brittle [1] services so there were cron scripts that ran every minute and would alert to slack when it was down @ing someone. Likewise rolled own dumb logger that just simply existed as a middleware initially to capture requests and responses and certain events were logged to slack and disk in the code with `event(enum, json)`. This slack bot could also dump certain info for users or events with some slash commands provided you had an id. From these logs and other bits added over time could see when execs connected or someone had difficulty with auth, a job or method took abnormally long and current active sessions etc. This grew to support ceo, marketing and other devs and got pretty involved at some point we had small services tied in that could visualize geojson over a map for a trip completed, dump a replay session or get all stacktraces for the day. Also for 3rd party services we couldn't tie into directly used a proxy setup where we didn't call it directly but inside a wrapper where we could capture data so a call to `api.somesite.com/v1/events` became `mysite.com/proxy?site=api.somesite.com/v1/events` in our apps so when our clients called this we knew and could again log. Since this seems close enough to the similar problem I had you could take a look at this approach and start with what's being requested or the repeating problems and have a central hub for others ingest these via discord or slack and appropriate channels #3rd-party-uptimes, #backups, #raw-logs, #events. From this we rarely used our dashboards, bugsnag or had the need to ssh into any server to pull access or error logs. - [1] This one was particularly so because they had a org policy to randomly reset vpn passwords and the only way to change it was using a desktop client to basically set the same password again. | |||||||||||||||||
▲ | garettmd 13 hours ago | parent | prev | next [-] | ||||||||||||||||
I'd argue this is an issue not just with "non-tech" folks, but even engineers who don't have experience with Prometheus and other time-series databases. Learning promql always seemed like a hard thing to ask of other engineers. Grafana has made it easier to explore and build queries over time, but there are still quirks and nuances that can be difficult to explain to people whose role doesn't typically involve scouring through metrics. | |||||||||||||||||
| |||||||||||||||||
▲ | toomuchtodo 14 hours ago | parent | prev [-] | ||||||||||||||||
https://www.statuspage.io/ or an open equivalent (https://github.com/oneuptime/oneuptime) perhaps? It's a straightforward implementation shim to distill from engineering telemetry to the non tech interface in my experience. Think of this as scaling engineering comms to non engineering parts of the org. | |||||||||||||||||
|