How would one set this sort of test up? I surely have example domains where LLMs routinely do poorly (for example, custom bazel rules and workspaces), but what would constitute a "showcase" here?

▲

lijok 3 hours ago | parent [-]

To change my mind I’ll be satisfied with a thorough description of the domain and ideally a theory on why it does poorly in that domain. But we’re not talking LLMs here, we’re talking opus4.5 specifically.

▲

j2kun 3 hours ago | parent | next [-]

A theory besides... not enough training data? Is it even possible to formulate a coherent theory about this? I'm talking about customizing a widely-used build system, not exactly state-of-the-art cryptography. What could I possibly say that you wouldn't counter with "skill issue" (which goes back to the author's point)?

If you say it's demonstrably impossible that someone can't be made more productive with opus4.5, then it should probably be up to you to demonstrate impossibility.

	▲	lijok 3 hours ago \| parent [-]
		How could it possibly be a skill issue? Have you tried in earnest to use opus4.5 for the problem you’re trying to solve? Not enough training data couldn’t be the problem - Bazel is not an esoteric domain. Unless you’re trying to do something esoteric.

▲

2 hours ago | parent | prev [-]

[deleted]