Reproducing experimental results across models and vendors is trivial and cheap nowadays.

Not if anthropic goes further in obfuscating the output of claude code.

	▲	vntok 7 minutes ago \| parent [-]
		Why would you test implementation details? Test what's delivered, not how it's delivered. The thinking portion, synthetized or not, is merely implementation. The resulting artefact, that's what is worth testing.