"astounding how much the harness matters" is the right read and it should be the lasting one. the model is rentable, the prompts are rentable, the benchmark numbers are mostly a function of the harness around them. swapping Gemini for Sonnet underneath the same harness has a smaller bench delta than swapping the harness around the model. the cheating-agents post you linked is the same observation through a different lens, the harness is what's being measured, the model is just the substrate.

that said context management seem to be solving today model problems, more than being an universal property, and will probably be obsoleted a few model generations down the road, as tool obsoleted RAG context injection from question embeddings.

▲

himata4113 4 hours ago | parent | next [-]

That's why ARC-AGI-3 doesn't allow the use of a harnesses. The model has to create the harness instead.

	▲	grzracz 35 minutes ago \| parent \| next [-]
		Seems completely backwards to me. This is like judging Formula 1 just by the raw power of the engine. The rest of the car has just as much engineering, if not more.
	▲	vova_hn2 an hour ago \| parent \| prev [-]
		The model is not allowed to create a harness either, I think.

▲

4 hours ago | parent | prev [-]

[deleted]