Benchmarking OpenTelemetry: Can AI trace your failed login?

▲ Benchmarking OpenTelemetry: Can AI trace your failed login?(quesma.com)

136 points by stared 7 hours ago | 77 comments

▲ dang an hour ago | parent | next [-]

Submitters: "Please use the original title, unless it is misleading or linkbait; don't editorialize." - https://news.ycombinator.com/newsguidelines.html

If you want to say what you think is important about an article, that's fine, but do it by adding a comment to the thread. Then your view will be on a level playing field with everyone else's: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...

(Submitted title was "OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)")

▲ the_duke 6 hours ago | parent | prev | next [-]

This is very confusingly written.

From the post I expected that the tasks were about analysing traces, but all the tasks in the repository are about adding instrumentation to code!

Some of the instructions don't give any guidance how to do it, some specify which libraries to use.

"Use standard OTEL patterns" ... that's about as useful as saying "go write some code". There are a lot of ways to do instrumentation....

I'd be very curious HOW exactly the models fail.

Are the test sets just incredibly specific about what output they except, and you get a lot of failures because of tiny subtle mismatches? Or do they just get the instrumentation categorically wrong?

Also important: do the models have access to a web search tool to read the library docs? Otel libraries are often complicated to use... without reading latest docs or source code this would be quite tricky.

Some models have gotten better at adding dependencies, installing them and then reading the code from the respective directory where dependencies get stored, but many don't do well with this.

All in all, I'm very skeptical that this is very useful as a benchmark as is.

I'd be much more interested in tasks like:

Here are trace/log outputs , here is the source code, find and fix the bug.

▲ sathish316 2 hours ago | parent | next [-]

+1 I’m not sure if tasks like Add OTel instrumentation belongs more in a Coding bench than an SRE bench. I came here expecting to see things like, this is how Models perform on finding the root cause in 50 complicated microservice failure scenarios.

For AI-SRE tasks like finding root cause of bugs and errors, I believe the key is to provide tools to the agent to query metrics, logs, traces and understand the problem. I’m working on a similar OSS framework and benchmark (work in progress using metrics and logs - demo - https://youtube.com/playlist?list=PLKWJ03cHcPr3Od1rwL7ErHW1p...), where context is Semantics and Text2SQL to query the right metrics, logs and benchmark is on a set of Skills that Claude code or other agents can run using these tools to find the root cause of errors:

Codd Semantic/Text2SQL engine: https://github.com/sathish316/codd_query_engine

PreCogs skills and simulated scenarios: https://github.com/sathish316/precogs_sre_oncall_skills

▲ ambicapter 3 hours ago | parent | prev | next [-]

> "Use standard OTEL patterns" ... that's about as useful as saying "go write some code".

People say to say things like "Use best practices" in your prompts all the time, and chide people who don't.

▲ ndriscoll 3 hours ago | parent | next [-]

Are these the same people who say it doesn't work well? I've been experimenting with writing what I actually mean by that (with the help of an LLM, funny enough), and it seems to be giving me much better code than the typical AI soup. e.g.

  - functional core, imperative shell. prefer pure helpers.
  - avoid methods when a standalone function suffices
  - use typed errors. avoid stringly errors.
  - when writing functions, create a "spine" for orchestration
  - spine rules: one dominant narrative, one concept per line, named values.
  - orchestration states what happens and in what order
  - implementation handles branching, retries, parsing, loops, concurrency, etc.
  - apply recursively: each function stays at one abstraction level
  - names describe why something exists, not how it is computed

etc.

This is no different from writing a style guide for your team/org. You don't just say "write clean code" and expect that you'll get something you like.

▲

dudeinhawaii 2 hours ago | parent [-]

To play devils advocate, why do we have to layout a simple task in PAINSTAKING DETAIL to an AI model which is "PHD LEVEL" and going to take our jobs in 6-12 months?

Why am I still holding its hand like it has the intellect and experience of a new-hire intern that's coded one project in college?

I would never expect to have to layout every detail about "how to write code" to someone I hired to code on my team, at the SWEII and above level. (I.e, sub-senior but beyond junior)

In fact, often times backlog items are "fix bug in x where y is happening" or "add instrumentation to X so that we can see why it's crashing at runtime".

	▲	ndriscoll an hour ago \| parent \| next [-]
		I find that generally it does alright picking up the style of what exists on its own, so this is more important if it's writing something completely from scratch. I think also "how to write code" is a matter of taste. e.g. in many ways I think I and a Laravel or Rails developer would each think that the other person's code is bad. e.g. as a small-ish thing, I think test-driven development sounds like a massive waste of time, but type-driven development is a huge productivity multiplier and makes the code a lot clearer. I'm sure that I have massive disagreements with e.g. the Go maintainers about what is straightforward.
	▲	ronsor 2 hours ago \| parent \| prev \| next [-]
		> PHD LEVEL It is PhD level. Most PhD students write awful code that's worse than AI.
	▲	simonw an hour ago \| parent \| prev [-]
		Because the models aren't PhD level and aren't going to take our jobs in 6-12 months. That's hype. If you want to use these things effectively you need to ignore the hype and focus on what they can actually do.

▲ noitpmeder an hour ago | parent | prev [-]

I hate that it's true, but things like this make outputs night-and-day for me. This is the difference e.g. of a model writing appropriate test harnesses, or pushing back on requirements, vs writing the most absolute horrible code and test/dependency injection I've ever seen in pursuit of the listed goals.

Similar to adjacent commentors I've tried to be better at enumerating what I consider to be best practice, but I couldn't argue in good faith that instructions like these produce no noticible improvment.

(As with all things AI, it could all be percepion on my end, so YMMV, wish there was a better way to concretely evaluate effects on outcomes of different rule sets / instructions / ...)

▲ pixl97 6 hours ago | parent | prev | next [-]

>Some of the instructions don't give any guidance how to do it, some specify which libraries to use.

In supporting a piece of cloud software with a lot of microservices I think this is a more generalized problem for humans. The app I work with demanded some logging requirements like the library to use. But that was it, different parts by different teams ended up with all kinds of different behaviors.

As for the AI side, this is something where I see our limited context sizes causing issues when developing architecture across multiple products.

	▲	chaps 5 hours ago \| parent \| next [-]
		This is definitely not a context problem. Very simple things like checking for running processes and killing the correct one is something that models like opus 4.5 can't do consistently correct... instead of recognizing that it needs to systematize that sort of thing -- one and done. Like, probably 50% of the time it kills the wrong thing. About 25% of the time after that it recognizes that it didn't kill the correct thing and then rewrites the ps or lsof from scratch and has the problem again. Then if I kill the process myself out of frustration it checks to see if the process is running, sees that it's not, then gets confused and sets its new task to rewrite the ps or lsof... again. It does the same thing with tests, where it decides to just, without any doubt in its rock brain, delete the test and replace it with a print statement.
	▲	bob1029 6 hours ago \| parent \| prev [-]
		> limited context sizes Context size isn't the issue. You cannot effectively leverage an infinite context if you had one anyways. The general solution is to recursively decompose the problem into smaller ones and solve them independently of each other, returning the results back up the stack. Recursion being the key here. A bunch of parallel agents on separate call stacks that don't block on their logical callees is a slop factory.

▲ YetAnotherNick 3 hours ago | parent | prev | next [-]

Looked into some tests and the tasks are definitely AI written. I think then a separate AI call generated the test.

▲ julienfr112 4 hours ago | parent | prev [-]

Like with robotaxi, ok, the thing is not perfect, but how does this compare to an human ? I'm interviewing OPS / SRE at the moment , and i'm not so happy with what I see...

	▲	esseph 4 hours ago \| parent [-]
		If you're interviewing Ops don't expect them to know anything about OTEL. Ops is about platforms, systems, and operations surrounding and supporting the application. Integration of OTEL into an application stack requires explicitly knowledge of the code - the developers.

▲ raincole 6 hours ago | parent | prev | next [-]

Original title: Benchmarking OpenTelemetry: Can AI trace your failed login?

HN Editorialized: OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)

The task:

> Your task is: Add OTEL tracing to all microservices.

> Requirements:

> Instrumentation should match conventions and well-known good practices.

> Instrumentation must match the business domain of the microservices.

> Traces must be sent to the endpoint defined by a standard OTEL environment variable.

> Use the recent version of the OTEL SDK.

I really don't think anything involved with multiple microservices can be called 'simple' even to humans. Perhaps to an expert who knows the specific business's domain knowledge it is.

▲

pixl97 6 hours ago | parent | next [-]

As someone whos job is support more than SWE, I agree with this.

I've had to work in systems where events didn't share correlation IDs, I had to go in and filter entries down to microseconds to get a small enough number of entries that I could trace what actually happened between a set of services.

From what I've seen in the enterprise software side of the world is a lot of companies are particularly bad at SRE and there isn't a great amount of standardization.

	▲	formerly_proven 6 hours ago \| parent [-]
		Top 20 company globally by revenue Enterprise app observability is purely a responsibility of each individual application/project manager. There is virtually no standardization or even shared infra, a team just stuffing plaintext logs into an unconfigured elasticsearch instance is probably above median already. There is no visibility for anything across departments and more often that not, not even across apps in a department.

▲

chaps 6 hours ago | parent | prev [-]

Having done app support across many environments, um - yes, multiple microservices is usually pretty simple. Just look at the open file/network handles and go from there. It's absolutely maddening to watch these models flail in trying to do something basic as, "check if the port is open" or "check if the process is running... and don't kill firefox this time".

These aren't challenging things to do for an experienced human at all. But it's such a huge pain point for these models! It's hard for me to wrap my head around how these models can write surprisingly excellent code but fail down in these sorts of relatively simple troubleshooting paths.

▲

jmalicki 4 hours ago | parent [-]

They have code in training data, and you have e.g. git where you can see how the code evolved, and they can train on PR reviews on comments.

There isn't much posted in the way of "bash history and terminal output of successful sysadminning" on the web

	▲	chaps an hour ago \| parent [-]
		I'm not sure that finding and killing the correct process is something I'd consider to be a "sysadmin task". That's something you learn in the first day of just about any linux course/primer and there are many examples of its use online. It's more that the default is to overuse tools that cast too-wide nets like pgrep and pkill. And it doesn't know how to use the output well enough. Like, when these systems do ps, it identifies random processes in the list instead of identifying the most recent process that it, itself, started. It's as if some SRE-type person decided to hard code pgrep and pkill because it's their personal preference.

▲ whynotminot 6 hours ago | parent | prev | next [-]

I would wager the main reason for this is the same reason it’s also hard to teach these skills to people: there’s not a lot of high quality training for distributed debugging of complex production issues. Competence comes from years of experience fighting fires.

Very few people start their careers as SREs, it’s generally something they migrate into after enjoying it and showing aptitude for it.

With that said, I wouldn’t expect this wall to hold up for too long. There has been a lot of low hanging fruit teaching models how to code. When that is saturated, the frontier companies will likely turn their attention to honing training environments for SRE style debug.

	▲	tetha 2 hours ago \| parent \| next [-]
		> I would wager the main reason for this is the same reason it’s also hard to teach these skills to people: there’s not a lot of high quality training for distributed debugging of complex production issues. Competence comes from years of experience fighting fires. The search space for a cause beyong a certain size can also be big. Very big. Like, at work we're at the beginning of where the powerlaw starts going nuts. Somewhere around 700 - 1000 services in production, across several datacenters, with a few dozen infrastructure clusters behind it. For each bug, if you looked into it, there'd probably by 20 - 30 changes, 10 - 20 anomalies, and 5 weird things someone noticed in the 30 minutes around it. People already struggle at triaging relevance of everything in this context. That's something I can see AI start helping and there were some talks about Meta doing just that - ranking changes and anomalies in order of relevance to a bug ticket so people don't run after other things. That's however just the reactive part of OPS and SRE work. The proactive part is much harder and oftentimes not technical. What if most negatively rated support cases run into a dark hole in a certain service, but the responsible team never allocates time to improve monitoring, because sales is on their butt for features? LLMs can identify this maybe, or help them implement the tracing faster, but those 10 minutes could also be spent on features for money. And what AI model told you to collect the metrics about support cases and resolution to even have that question?
	▲	hosh 4 hours ago \| parent \| prev \| next [-]
		I disagree. AI works as a better tool for teaching humans than to do the work themselves. While someone experienced in fighting fires can take intuitive leaps, the basic idea is still to synthesize a hypothesis from signals, validating the hypothesis, and coming up with mitigations and longer term fixes. This is a learned skill, and a team of people/AI will work better than someone solo. https://hazelweakly.me/blog/stop-building-ai-tools-backwards...
	▲	heliumtera 5 hours ago \| parent \| prev \| next [-]
		There is definitely more to the inability for models to perform well at SRE. One, it is not engineering, it is next token prediction, it is vibes. They could do Site Reliability Vibing or something like that. When we ask it to generate an image, any image will do it. We couldn't care less. Try to sculpt it, try to rotate it 45 degrees and all hell breaks loose. The image would be rotated but the hair color could change as well. Pure vibes! When you ask it to refactor your code, any pattern would do it. You could rearrange the code in infinite ways, rename variables in infinite ways without fundamentally breaking logic. You could make as many arbitrary bullshit abstraction and call it good, as people have done it for years with OOP. It does not matter at all, any result would do it in this cases. When you want to hit an specific gRPC endpoint, you need an specific address and the method expects an specific contract to be honored. This either matches or it doesn't. When you wish the llms could implement a solution that captures specifics syscalls from specifics hosts and send traces to an specific platform, using an specific protocol, consolidating records on a specific bucket...you have one state that satisfy your needs and 100 requirement that needs to necessarily be fulfilled. It either meet all the requirements or it's no good. It truly is different from Vibing and llms will never be able to do in this. Maybe agents will, depending on the harnesses, on the systems in place, but one model just generate words words words with no care about nothing else
	▲	lysace 6 hours ago \| parent \| prev [-]
		> With that said, I wouldn’t expect this wall to hold up for too long. The models are already so good at the traditionally hard stuff: collecting that insane amount of detailed knowledge across so many different domains, languages and software stacks.

▲ asyncadventure 7 hours ago | parent | prev | next [-]

This aligns with my experience trying to automate observability tasks - AI excels at individual coding patterns but struggles with the holistic understanding needed for distributed tracing. The 29% success rate actually seems optimistic considering how OpenTelemetry requires deep context about service boundaries and business logic, not just syntactic correctness.

▲

jakozaur 6 hours ago | parent [-]

In this benchmark, micro-services are really small, ~300 lines, and sometimes just two of them. More realistic tasks (large codebases, more microservices) would have a lower success rate.

	▲	ndriscoll 5 hours ago \| parent [-]
		I'd expect it to actually do better in a large codebase. e.g. you'd already have an HTTP middleware stack, so it'd know that it can just add a layer to that for traces (and in fact there might already be off-the-shelf layers for whatever framework) vs. having to invent that on its own for the bare microservice.

▲ dgxyz 7 hours ago | parent | prev | next [-]

Our humans struggle with them too. It’s the only domain where you need actually to know everything.

I wouldn’t touch this with a pole if our MTTR was dependent on it being successful though.

	▲	vasco 7 hours ago \| parent [-]
		I can say that as someone that does this for a job for a while, it's starting to be useful in many domains related to SRE that make parts of the job easier. MCP servers for monitoring tools are making our developers more competent at finding metrics and issues. It'll get there but nobody is going to type "fix my incident" in production and have a nice time today outside of the most simple things that if they are possible to fix like this, could've been automated already anyway. But between writing a runbook and automating sometimes takes time so those use cases will grow.

▲ jedberg 2 hours ago | parent | prev | next [-]

We've been experimenting with combining durable execution with debugging tasks, and it's working incredibly well! With the added context of actual execution data, defined by the developer as to which functions are important (instead of individual calls), it give the LLM the data it needs.

I know there are AI SRE companies that have discovered the same -- that you can't just throw a bunch of data at a regular LLM and have it "do SRE things". It needs more structured context, and their value add is knowing what context and what structure is necessary.

▲ nyellin 2 hours ago | parent | prev | next [-]

HolmesGPT maintainer here: our benchmarks [1] tell a very different story, as does anecdotal evidence from our customers- including Fortune 500 using SRE agents in incredibly complex production environments.

We're actually struggling a bit with benchmark saturation right now. Opus does much better in the real world than Sonnet but it's hard to create sophisticated enough benchmarks to show that in the lab. When we run benchmarks with a small number of iterations Sonnet even wins sometimes.

[1] https://holmesgpt.dev/development/evaluations/history/

▲ dirtytoken7 3 hours ago | parent | prev | next [-]

The 29% score tells us more about benchmark design than model capability IMO.

These benchmarks conflate two very different problems: (1) understanding what needs to be done, and (2) correctly implementing it in a specific library ecosystem.

A human SRE who's never touched OTel would also struggle initially - not because they can't reason about traces, but because the library APIs have quirks that take time to learn.

The more interesting question is whether giving the model access to relevant docs/examples during the task significantly changes the scores. If it does, that suggests the bottleneck is recall not reasoning. If it doesn't, the reasoning gap is real.

FWIW I've found that models do much better on ops tasks when you can give them concrete examples of working instrumentation in the same codebase rather than asking them to generate from scratch.

▲ 0xferruccio 4 hours ago | parent | prev | next [-]

To be fair I remember spending almost two weeks implementing OTel at my startup, the infrastructure as code setup of getting collectors running within a kubernetes cluster using terraform was a nightmare two years ago.

I just kept running into issues, the docs were really poor and the configuration had endless options

▲ srijanshukla18 5 hours ago | parent | prev | next [-]

Humans can't do much OTelBench Try finding even good documentation for it

That's just misleading phrasing on this post

I'm an SRE, AI does NOT struggle with 'simple SRE tasks' OTel instrumentation by no measure is a 'simple SRE task'

▲ mellosouls 3 hours ago | parent | prev | next [-]

Related discussion the other day:

The future of software engineering is SRE (257 points, 139 comments)

https://news.ycombinator.com/item?id=46759063

▲ jcims 7 hours ago | parent | prev | next [-]

I've been building an 'sre agent' with LangGraph for the past couple of weeks and honestly I've been incredibly impressed with the ability for frontier models, when properly equipped with useful tools and context, to quickly diagnose issues and suggest reasonable steps to remediate. Primary tooling for me is access to source code, cicd environment and infrastructure control plane. Some cues in the context to inform basic conventions really helps.

Even when it's not particularly effective, the additional information provided tends to be quite useful.

▲ hakanderyal 5 hours ago | parent | prev | next [-]

Anyone that have spent serious time with agents know that you cannot expect out-of-the-box success without good context management, despite what the hyping crowd would claim.

Have AI document the services first into a concise document. Then give it proper instructions about what you expect, along with the documentation created.

Opus would pass that.

We are not there yet, the agents are not ready to replace the driver.

▲

parliament32 5 hours ago | parent [-]

Sounds like it'd be faster to just do it yourself.

	▲	hakanderyal 5 hours ago \| parent \| next [-]
		If you are not going all in with agents, yes, it would. On the other hand, the documentation & workflows need to be created only once. You need to invest a bit upfront to get positive RoI.
	▲	pixl97 4 hours ago \| parent \| prev [-]
		Until you have a whole team doing it differently because of no spec.

▲ ripped_britches 5 hours ago | parent | prev | next [-]

Maybe I haven’t dug in enough, but why is the second GET request a different trace?

Is it clicking a different result from same search?

It’s possible that the requirements here are not clear, given that the instructions don’t detail how to handle such a situation and it’s not obvious to me as a human.

	▲	fragmede 41 minutes ago \| parent [-]
		Why wouldn't it be, it's a different request. If you've got an entire distributed system, the same GET request a millisecond later could get routed entirely differently, and succeed or fail. Even the caching layer is suspect.

▲ winton 7 hours ago | parent | prev | next [-]

So if I try to do it with Opus three or four times, I'll get it done? And probably in about 10 minutes? Awesome

	▲	stared 6 hours ago \| parent \| next [-]
		Nope, these are no random dice rolls. Some times are solved each run, a few - occasionally (so here would be meaningful to try a few times - and metrics of pass@1 and pass@3 would be different), but most are never solved. See e.g.: https://quesma.com/benchmarks/otel/models/claude-opus-4.5/
	▲	throwup238 6 hours ago \| parent \| prev [-]
		That’s only if the failures are truly random and aren’t correlated

▲ smithclay 6 hours ago | parent | prev | next [-]

We need more rigorous benchmarks for SRE tasks, which is much easier said that done.

The only other benchmark I've come across is https://sreben.ch/ ... certainly there must be others by now?

	▲	nyellin 2 hours ago \| parent [-]
		We publish the benchmarks for HolmesGPT (CNCF sandbox project) at https://holmesgpt.dev/development/evaluations/

▲ jp57 3 hours ago | parent | prev | next [-]

Which have longer lifecycles, LLM model versions, or trends in SRE practices?

▲ esafak 4 hours ago | parent | prev | next [-]

This is a good idea. It makes sense that they would struggle because there is not much training data.

▲ yomismoaqui 6 hours ago | parent | prev | next [-]

I'm a human with 20+ years of experience and making OTEL work on Go was painful.

It made me remember when I was working on the J2EE ecosystem shudder

▲ AnotherGoodName 7 hours ago | parent | prev | next [-]

This is a little damning of the way Google does things honestly.

>When an app runs on a single machine, you can often trace an error by scrolling through a log file. But when it runs across 50 microservices, that single request gets scattered into a chaotic firehose of disconnected events.

Yep this is about Google. It's painful for humans to debug and it's also an extremely bespoke issue to deal with. No one else has quite the same level of clusterfuck and there's going to be no training for LLMs on this.

▲

youknownothing 6 hours ago | parent | next [-]

isn't that what trace IDs are for?

▲

belval 6 hours ago | parent | next [-]

Yeah I don't know their stack but I have a service that is a collection of microservices and Opus can debug them fine by aggregating the logs tied to the same faulty request ID.

In general for those tasks though the question is more "How would a human do it". If it's impossible for a human because your tooling is so bad you can't even get the logs across services for a single ID, that seems like a pretty serious design issue.

In general looking at the prompt though, this is also not very representative. You don't have an SOP that you can share with your agent? How do you expect new hires to onboard?

	▲	pixl97 5 hours ago \| parent [-]
		>How do you expect new hires to onboard? I've seen some places that pretty much say "Good luck, we hope you can swim. Life preserver not provided"

▲

pixl97 6 hours ago | parent | prev [-]

Much like nested errors, management of trace IDs becomes difficult under scale as you will start getting multiple correlation references in complex systems.

▲

tayo42 6 hours ago | parent | prev [-]

It's bespoke to debug across multiple services?

This seems like typical work in any business that isn't trivial.

	▲	AnotherGoodName 6 hours ago \| parent [-]
		Not to the same extent. Microservices aren't actually about making things better for developers in any way. It's simply a way to address a scaling issue. Eg. Facebook (i've worked at Meta and Google amongst others so a good way to compare extremes) is entirely a monolith. You type a line of code, hit refresh and you see it, running fully in the context of everything else your dev server does. It's still statically typed so a type error is seen quickly in the full context of everything that the server can do and in general there's just no impetus to move to microservices since the deployment of the monolith takes no time. Every server running Facebook runs the exact same image. That's not to say Hack is a perfect language or anything. It's basically PHP made to look and act like Java which isn't great, but the fact is you never ever think of how the code runs and interacts in context of the microservice environment. You don't need to. Everyone who's worked at Meta and Google has the opinion that Meta moves faster and this is part of the reason. Some companies have architectures that can't deploy like this. This is the reason you move to microservices. It's not at all a developer velocity win. It's just needed if you have frameworks that don't allow you to run and deploy "all the code ever written in the company" in a reasonable way. You need to break it up in modular pieces that have defined boundaries so that you only run the parts you need as you develop (defined boundaries are a dev win sure but that can be done without microservices). Google has gotten to the point where things are getting really fined grained and honesty chaotic. Moving to a portion of code to its own microservice is basically a promo bait 6 month project, often done without justification other than "everything should be its own microservice". In my time at Google i never heard "what benefit do we get if this is a microservice?" it's just assumed to always be a good thing. 50 interacting microservices to go through in a trace is at the point where the only place I've seen such a thing is Google.

▲ derfurth 5 hours ago | parent | prev | next [-]

In my experience the approach matters a lot, I recently implemented Otel with Claude Code in a medium sized ~200k loc project:

- initially it wasn't working, plenty of parent/child relationships problems like described in the post

- so I designed a thin a wrapper and used sealed classes for events instead of dynamic spans + some light documentation

It took me like a day to implement tracing on the existing codebase, and for new features it works out of the box using the documentation.

At the end of the day, leveraging typing + documentation dramatically constrains LLMs to do a better job

▲ 0xbadcafebee 4 hours ago | parent | prev | next [-]

Is it just me or is that prompt... not ideal? There's no concrete simple goals, no mention of testing, no loop. No description of the problem space or what success should look like. One-shot might work for this with frontier models, but they often need more for success.

Saying "any SRE should be able to do this" is already problematic, because regardless of title, there are smarter people and dumber people. You're taking a gamble giving a human SRE this prompt. Whether it's AI or human, give it more context and instruction, or failure is likely. (And more importantly: use a loop so it can fix itself!)

(also: SRE is too generic... there are a dozen kinds of SRE)

▲ elAhmo 4 hours ago | parent | prev | next [-]

Key is "for now".

▲ NitpickLawyer 6 hours ago | parent | prev | next [-]

I'm always interested in new benchmarks, so this is cool. I only had a brief look at [1] and [2], a few quick things that I noticed:

For [1]: instruction.md is very brief, quite vague and "assumes" a lot of things.

- Your task is: Add OTEL tracing to all microservices. Add OTEL logging to all microservices. (this is good)

- 6.I want to know if the microservice has OTEL instrumentation and where the data is being sent. (??? i have no idea what this means)

- 9.Use the recent version of the OTEL SDK. (yeah, this won't work unless you also use an MCP like context7 or provide local docs)

What's weird here is that instruct.md has 0 content regarding conventions, specifically how to name things. Yet in tests_outputs you have this "expected_patterns = ["order", "stock", "gateway"]" and you assert on it. I guess that makes some sense, but being specific in the task.md is a must. Otherwise you're benching assumptions, and those don't even work with meatbags :)

For [2]: instruction.md is more detailed, but has some weird issues:

- "You should only be very minimal and instrument only the critical calls like request handlers without adding spans for business calls \n The goal is to get business kind of transaction" (??? this is confusing, even skipping over the weird grammar there)

- "Draw ascii trace diagram into /workdir/traces.txt" (????)

- "When modifying Python files, use Python itself to write files or use sed for targeted changes" (? why are you giving it harness-specific instructions in your instruct.md? this is so dependent on the agentic loop used, that it makes no sense here.

- "Success Criteria: Demonstrate proper distributed tracing \n Include essential operations without over-instrumenting (keep it focused) \n Link operations correctly \n Analyze the code to determine which operations are essential to trace and how they relate to each other. (i mean ... yes and no. these are not success criteria IMO. It's like saying "do good on task not do bad". This could definitely be improved.)

----

Also, I noticed that every folder has a summary_claude... that looks like a claude written summary over a run. I hope that's not what's used in actually computing the benchmark scores. In that case, you're adding another layer of uncertainty in checking the results...

The ideea is nice, but tbf some of the tests seem contrived, your instructions are not that clear, you expect static naming values while not providing instructions at all about naming conventions, and so on. It feels like a lot of this was "rushed"? I peaked a bit at the commit history and saw some mentions of vibe-coding a viewer for this. I hope that's the only thing that was vibe-coded :)

[1] - https://github.com/QuesmaOrg/otel-bench/tree/main/datasets/o...

[2] - https://github.com/QuesmaOrg/otel-bench/blob/main/datasets/o...

▲ lenerdenator 3 hours ago | parent | prev | next [-]

This just reinforces the notion that if you don't have someone who at least roughly knows what they're doing giving a very detailed prompt and checking the output, you're wasting tokens.

Plan mode is your friend.

▲ benatkin 4 hours ago | parent | prev | next [-]

> AI SRE in 2026 is what DevOps Anomaly Detection was in 2015 — bold claims backed by huge marketing budgets, but lacking independent verification. There are stories of SaaS vendors abruptly killing the observability stack. Our results mirror ClickHouse’s findings: while LLMs can assist, they lack the capabilities of a skilled SRE.

The key is LLMs can assist. It would be nice if they went farther into this, and seen how much more quickly a human that wrote a complex prompt, or went back and forth with a coding agent, could do the tasks compared to an unassisted human. I'm confident that it's at a level that already has profound implications for SRE. And the current level of getting it right with a simple prompt is still impressive.

▲ heliumtera 6 hours ago | parent | prev | next [-]

Standard SRE tasks are bad benchmarks.

First of all, familiarity with open telemetry apis is not knowledge, they are arbitrary constructs.

We are implying that conforming to a standard is the only way, the right way. I would challenge that.

Assuming models were good at this tasks, we could only conclude that this tasks were trivial AND sufficiently documented. Assuming they were good at this type of tasks (they can be trained to be good cheaply, we know that based on similar acquired capabilities) making a benchmark out of it would be less useful.

But I am sure nobody really cares and the author just had to SEO a little bit regardless of reality

▲ linuxftw 6 hours ago | parent | prev | next [-]

The prompts for this are pretty sparse. This could 100% be accomplished with better prompting. Even with the current prompts, it's likely I could complete the task with a follow up request specifying what it did correctly and incorrectly. In fact, this could probably be entirely automated with multiple agents checking each other.

▲ vachina 5 hours ago | parent | prev | next [-]

LLM is AI now, wow.

Also LLM is a very advanced autocomplete algorithm. And autocomplete isn’t designed to write for you, you have to write first.

▲ whalesalad 7 hours ago | parent | prev | next [-]

If everyone else is the problem... maybe you are the problem. To me this says more about OTel than AI.

▲

apercu 7 hours ago | parent | next [-]

Can you help me understand where you are coming from? Is it that you think the benchmark is flawed or overly harsh? Or that you interpret the tone as blaming AI for failing a task that is inherently tricky or poorly specified?

My takeaway was more "maybe AI coding assistants today aren’t yet good at this specific, realistic engineering task"....

	▲	hobofan 6 hours ago \| parent \| next [-]
		In my experience many OTEL libraries are aweful to use and most of the "official" ones are the worst offenders as the are largely codegened. That typically makes them feel clunky to use and they exhibit code patterns that are non-native to the language used, which would an explanation of why AI systems struggle with the benchmark. I think you would see similar results if tasking an AI to e.g. write GRPC/Protobuf systems using only the builtin/official protobuf codegen languages. Where I think the benchmark is quite fair is in the solutions. It looks like for each of the languages (at least the ones I'm familiar with), the "better" options were chosen, e.g. using `tracing-opentelemtry` rather than `opentelemetry-sdk` directly in Rust. However the one-shot nature of the benchmark also isn't that reflective of the actual utility. In my experience, if you have the initial framework setup done in your repo + a handful of examples, they do a great job of applying OTEL tracing to the majority of your project.
	▲	pixl97 6 hours ago \| parent \| prev [-]
		Where I work we are looking at a lot of our documentation and implementations where AI has a hard time when doing it. This almost always correlates with customers having similar issues in getting things working. This has lead us to rewrite a lot of documentation to be more consistent and clear. In addition we set out series of examples from simple to complex. This shows as less tickets later, and more complex implementations being setup by customers without the need for support.

▲

vimda 7 hours ago | parent | prev [-]

But not everyone else is the problem? OTel works fine for humans. Sometimes AIs are just shit

	▲	devin 6 hours ago \| parent \| next [-]
		It's not a new thing to bring up that OTel is difficult to get correct. This was a criticism levied before the AI era.
	▲	heliumtera 6 hours ago \| parent \| prev [-]
		That is a wild claim my dude. Some of the comments here would challenge the claim that otel has worked pretty well for humans.

▲ rapsacnz 4 hours ago | parent | prev [-]

I'd argue that this is just another reason not to use microservices.