Remix.run Logo
dirtytoken7 4 hours ago

The 29% score tells us more about benchmark design than model capability IMO.

These benchmarks conflate two very different problems: (1) understanding what needs to be done, and (2) correctly implementing it in a specific library ecosystem.

A human SRE who's never touched OTel would also struggle initially - not because they can't reason about traces, but because the library APIs have quirks that take time to learn.

The more interesting question is whether giving the model access to relevant docs/examples during the task significantly changes the scores. If it does, that suggests the bottleneck is recall not reasoning. If it doesn't, the reasoning gap is real.

FWIW I've found that models do much better on ops tasks when you can give them concrete examples of working instrumentation in the same codebase rather than asking them to generate from scratch.