Remix clone Hacker News

new | show | ask | jobs Github

	▲	PUSH_AX 16 hours ago
		They set themselves up for flack when they use whatever these evals are… they did the same for composer 2 which was evaled in close competition with frontier models, spoiler alert, it wasn’t even close in practice. So now 2.5 is supposed to compete with opus 4.7? Sure…
	▲	tuo-lei 15 hours ago \| parent \| next [-]
		they say it themselves in the post - behavior dimensions "not well captured by existing benchmarks". that was the exact problem with composer 2. not dumber on individual tasks, just bad at session-level decisions like when to stop editing, how much context to carry forward, when to re-read a file vs assume. you don't catch any of that in an isolated eval.
	▲	infecto 11 hours ago \| parent \| prev \| next [-]
		As I have said before in prior composer threads. The proof is in the usage. I am inclined to somewhat believe the results as I use composer and also take the results for the given context. It’s not a general purpose sota model. It’s a model that runs inexpensively in their coding workflow that is creating results similar to opus or gpt.
	▲	criemen 15 hours ago \| parent \| prev [-]
		Well is that a statement about the quality of Opus 4.7 or about compose 2.5? :P