| ▲ | j2kun 3 hours ago | ||||||||||||||||||||||
How would one set this sort of test up? I surely have example domains where LLMs routinely do poorly (for example, custom bazel rules and workspaces), but what would constitute a "showcase" here? | |||||||||||||||||||||||
| ▲ | lijok 3 hours ago | parent [-] | ||||||||||||||||||||||
To change my mind I’ll be satisfied with a thorough description of the domain and ideally a theory on why it does poorly in that domain. But we’re not talking LLMs here, we’re talking opus4.5 specifically. | |||||||||||||||||||||||
| |||||||||||||||||||||||