Remix.run Logo
the_duke 7 hours ago

This is very confusingly written.

From the post I expected that the tasks were about analysing traces, but all the tasks in the repository are about adding instrumentation to code!

Some of the instructions don't give any guidance how to do it, some specify which libraries to use.

"Use standard OTEL patterns" ... that's about as useful as saying "go write some code". There are a lot of ways to do instrumentation....

I'd be very curious HOW exactly the models fail.

Are the test sets just incredibly specific about what output they except, and you get a lot of failures because of tiny subtle mismatches? Or do they just get the instrumentation categorically wrong?

Also important: do the models have access to a web search tool to read the library docs? Otel libraries are often complicated to use... without reading latest docs or source code this would be quite tricky.

Some models have gotten better at adding dependencies, installing them and then reading the code from the respective directory where dependencies get stored, but many don't do well with this.

All in all, I'm very skeptical that this is very useful as a benchmark as is.

I'd be much more interested in tasks like:

Here are trace/log outputs , here is the source code, find and fix the bug.

sathish316 3 hours ago | parent | next [-]

+1 I’m not sure if tasks like Add OTel instrumentation belongs more in a Coding bench than an SRE bench. I came here expecting to see things like, this is how Models perform on finding the root cause in 50 complicated microservice failure scenarios.

For AI-SRE tasks like finding root cause of bugs and errors, I believe the key is to provide tools to the agent to query metrics, logs, traces and understand the problem. I’m working on a similar OSS framework and benchmark (work in progress using metrics and logs - demo - https://youtube.com/playlist?list=PLKWJ03cHcPr3Od1rwL7ErHW1p...), where context is Semantics and Text2SQL to query the right metrics, logs and benchmark is on a set of Skills that Claude code or other agents can run using these tools to find the root cause of errors:

Codd Semantic/Text2SQL engine: https://github.com/sathish316/codd_query_engine

PreCogs skills and simulated scenarios: https://github.com/sathish316/precogs_sre_oncall_skills

ambicapter 5 hours ago | parent | prev | next [-]

> "Use standard OTEL patterns" ... that's about as useful as saying "go write some code".

People say to say things like "Use best practices" in your prompts all the time, and chide people who don't.

ndriscoll 4 hours ago | parent | next [-]

Are these the same people who say it doesn't work well? I've been experimenting with writing what I actually mean by that (with the help of an LLM, funny enough), and it seems to be giving me much better code than the typical AI soup. e.g.

  - functional core, imperative shell. prefer pure helpers.
  - avoid methods when a standalone function suffices
  - use typed errors. avoid stringly errors.
  - when writing functions, create a "spine" for orchestration
  - spine rules: one dominant narrative, one concept per line, named values.
  - orchestration states what happens and in what order
  - implementation handles branching, retries, parsing, loops, concurrency, etc.
  - apply recursively: each function stays at one abstraction level
  - names describe why something exists, not how it is computed
etc.

This is no different from writing a style guide for your team/org. You don't just say "write clean code" and expect that you'll get something you like.

dudeinhawaii 3 hours ago | parent [-]

To play devils advocate, why do we have to layout a simple task in PAINSTAKING DETAIL to an AI model which is "PHD LEVEL" and going to take our jobs in 6-12 months?

Why am I still holding its hand like it has the intellect and experience of a new-hire intern that's coded one project in college?

I would never expect to have to layout every detail about "how to write code" to someone I hired to code on my team, at the SWEII and above level. (I.e, sub-senior but beyond junior)

In fact, often times backlog items are "fix bug in x where y is happening" or "add instrumentation to X so that we can see why it's crashing at runtime".

ndriscoll 2 hours ago | parent | next [-]

I find that generally it does alright picking up the style of what exists on its own, so this is more important if it's writing something completely from scratch.

I think also "how to write code" is a matter of taste. e.g. in many ways I think I and a Laravel or Rails developer would each think that the other person's code is bad. e.g. as a small-ish thing, I think test-driven development sounds like a massive waste of time, but type-driven development is a huge productivity multiplier and makes the code a lot clearer. I'm sure that I have massive disagreements with e.g. the Go maintainers about what is straightforward.

ronsor 3 hours ago | parent | prev | next [-]

> PHD LEVEL

It is PhD level. Most PhD students write awful code that's worse than AI.

simonw 3 hours ago | parent | prev | next [-]

Because the models aren't PhD level and aren't going to take our jobs in 6-12 months.

That's hype. If you want to use these things effectively you need to ignore the hype and focus on what they can actually do.

fragmede an hour ago | parent | prev [-]

Don't worry about devil's advocate, if < 100 words feels like a gargantuan amount of documentation effort ("PAINSTAKING DETAIL"), well, there are certain stereotypes about developers (not) writing comments or documentation that come to mind. Whoever coined the term "prompt engineering" may have the last laugh (before the robots take over) after all.

noitpmeder 2 hours ago | parent | prev [-]

I hate that it's true, but things like this make outputs night-and-day for me. This is the difference e.g. of a model writing appropriate test harnesses, or pushing back on requirements, vs writing the most absolute horrible code and test/dependency injection I've ever seen in pursuit of the listed goals.

Similar to adjacent commentors I've tried to be better at enumerating what I consider to be best practice, but I couldn't argue in good faith that instructions like these produce no noticible improvment.

(As with all things AI, it could all be percepion on my end, so YMMV, wish there was a better way to concretely evaluate effects on outcomes of different rule sets / instructions / ...)

pixl97 7 hours ago | parent | prev | next [-]

>Some of the instructions don't give any guidance how to do it, some specify which libraries to use.

In supporting a piece of cloud software with a lot of microservices I think this is a more generalized problem for humans. The app I work with demanded some logging requirements like the library to use. But that was it, different parts by different teams ended up with all kinds of different behaviors.

As for the AI side, this is something where I see our limited context sizes causing issues when developing architecture across multiple products.

chaps 6 hours ago | parent | next [-]

This is definitely not a context problem. Very simple things like checking for running processes and killing the correct one is something that models like opus 4.5 can't do consistently correct... instead of recognizing that it needs to systematize that sort of thing -- one and done. Like, probably 50% of the time it kills the wrong thing. About 25% of the time after that it recognizes that it didn't kill the correct thing and then rewrites the ps or lsof from scratch and has the problem again. Then if I kill the process myself out of frustration it checks to see if the process is running, sees that it's not, then gets confused and sets its new task to rewrite the ps or lsof... again. It does the same thing with tests, where it decides to just, without any doubt in its rock brain, delete the test and replace it with a print statement.

bob1029 7 hours ago | parent | prev [-]

> limited context sizes

Context size isn't the issue. You cannot effectively leverage an infinite context if you had one anyways. The general solution is to recursively decompose the problem into smaller ones and solve them independently of each other, returning the results back up the stack. Recursion being the key here. A bunch of parallel agents on separate call stacks that don't block on their logical callees is a slop factory.

YetAnotherNick 5 hours ago | parent | prev | next [-]

Looked into some tests and the tasks are definitely AI written. I think then a separate AI call generated the test.

julienfr112 5 hours ago | parent | prev [-]

Like with robotaxi, ok, the thing is not perfect, but how does this compare to an human ? I'm interviewing OPS / SRE at the moment , and i'm not so happy with what I see...

esseph 5 hours ago | parent [-]

If you're interviewing Ops don't expect them to know anything about OTEL. Ops is about platforms, systems, and operations surrounding and supporting the application.

Integration of OTEL into an application stack requires explicitly knowledge of the code - the developers.