Remix.run Logo
raincole 8 hours ago

Original title: Benchmarking OpenTelemetry: Can AI trace your failed login?

HN Editorialized: OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)

The task:

> Your task is: Add OTEL tracing to all microservices.

> Requirements:

> Instrumentation should match conventions and well-known good practices.

> Instrumentation must match the business domain of the microservices.

> Traces must be sent to the endpoint defined by a standard OTEL environment variable.

> Use the recent version of the OTEL SDK.

I really don't think anything involved with multiple microservices can be called 'simple' even to humans. Perhaps to an expert who knows the specific business's domain knowledge it is.

pixl97 7 hours ago | parent | next [-]

As someone whos job is support more than SWE, I agree with this.

I've had to work in systems where events didn't share correlation IDs, I had to go in and filter entries down to microseconds to get a small enough number of entries that I could trace what actually happened between a set of services.

From what I've seen in the enterprise software side of the world is a lot of companies are particularly bad at SRE and there isn't a great amount of standardization.

formerly_proven 7 hours ago | parent [-]

Top 20 company globally by revenue

Enterprise app observability is purely a responsibility of each individual application/project manager. There is virtually no standardization or even shared infra, a team just stuffing plaintext logs into an unconfigured elasticsearch instance is probably above median already. There is no visibility for anything across departments and more often that not, not even across apps in a department.

chaps 7 hours ago | parent | prev [-]

Having done app support across many environments, um - yes, multiple microservices is usually pretty simple. Just look at the open file/network handles and go from there. It's absolutely maddening to watch these models flail in trying to do something basic as, "check if the port is open" or "check if the process is running... and don't kill firefox this time".

These aren't challenging things to do for an experienced human at all. But it's such a huge pain point for these models! It's hard for me to wrap my head around how these models can write surprisingly excellent code but fail down in these sorts of relatively simple troubleshooting paths.

jmalicki 5 hours ago | parent [-]

They have code in training data, and you have e.g. git where you can see how the code evolved, and they can train on PR reviews on comments.

There isn't much posted in the way of "bash history and terminal output of successful sysadminning" on the web

chaps 2 hours ago | parent [-]

I'm not sure that finding and killing the correct process is something I'd consider to be a "sysadmin task". That's something you learn in the first day of just about any linux course/primer and there are many examples of its use online.

It's more that the default is to overuse tools that cast too-wide nets like pgrep and pkill. And it doesn't know how to use the output well enough. Like, when these systems do ps, it identifies random processes in the list instead of identifying the most recent process that it, itself, started.

It's as if some SRE-type person decided to hard code pgrep and pkill because it's their personal preference.