This aligns with my experience trying to automate observability tasks - AI excels at individual coding patterns but struggles with the holistic understanding needed for distributed tracing. The 29% success rate actually seems optimistic considering how OpenTelemetry requires deep context about service boundaries and business logic, not just syntactic correctness.

▲

jakozaur 7 hours ago | parent [-]

In this benchmark, micro-services are really small, ~300 lines, and sometimes just two of them. More realistic tasks (large codebases, more microservices) would have a lower success rate.

	▲	ndriscoll 7 hours ago \| parent [-]
		I'd expect it to actually do better in a large codebase. e.g. you'd already have an HTTP middleware stack, so it'd know that it can just add a layer to that for traces (and in fact there might already be off-the-shelf layers for whatever framework) vs. having to invent that on its own for the bare microservice.