Remix.run Logo
nyrikki 3 days ago

> Larger models (e.g., Opus 4.1) often fail on semantic or algorithmic correctness in large, multi-file edits, whereas smaller models (e.g., Qwen 3 32B) more frequently fail due to issues in syntax and formatting, tool use, or context management.

While I haven’t dug into the details of this benchmark, this absolutely matches my personal experience.

Assuming “semantic correctness” is in the sense of Rice and runtime behavior.

While syntactic correctness has dramatically improved, security and architectural erosion and other long term issues have not.

Unfortunately Rice’s theorem also applies to finite programs in finite time too.

Actually it can apply to total functions in the general case.

I am still optimistic that coding agents will provide value long term in some fashion.

But the open domain frame problem simply reduces to the halting problem, yes and humans struggle with it too.

But fundamentally, PAC learning has to be reduced to _trivial_ problems, with only T/F.

We have found clever ways to work within these s limitations, but they still exist.

Hopefully we find clever ways to keep humans engaged with the code, while gaining the potential force multiplier that ML may offer.

The long tailed problems are particularly important, and while human SREs make mistakes and organizations often have constraints that add to the problem, SREs do a lot more to avoid those long tailed problems than they are given credit for.

IMHO that has always been one of the hardest parts of the industry and a true measure for what makes great team members.

Unfortunately the metrics and incentives often don’t capture that value.