What did you do around cross-harness testing? I don't see anything in the blog post about what harnesses were used in evaluation. SOTA benchmarks have consistently shown that frontier model performance is quite sensitive to what tools are exposed (e.g. str_replace vs. apply_patch) as the labs are RLing on their own harnesses. Did you do testing of the models in a standard setup or in their native harnesses?

▲

swyx 8 hours ago | parent [-]

yes well aware :) numbers shown are on "house" harnesses eg codex with gpt and claude code with opus.

fwiw we have examples of each model doing better on NON-house harnesses too - speaking jsut for myself i think the "the labs are RLing on their own harnesses" narrative is kinda overstated if you think through wanting to have any meaningful api business (often eg the labs will give guidance on what is prefered and the agent labs can easily match tool contract to that, which is to say, the "home turf advantage" isnt as large as you think it is if you try a little bit)

	▲	Bolwin 5 hours ago \| parent \| next [-]
		What is the "house" harness for minimax? They haven't released any
	▲	chris_st 7 hours ago \| parent \| prev [-]
		What "non-house" harnesses have you found to work best?