Using a single custom benchmark as a metric seems pretty unreliable to me.

Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.

▲

prodigycorp 3 hours ago | parent [-]

after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run.

This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se.

While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted.

my bad to the google team for the cursory brush off.

	▲	chermi an hour ago \| parent [-]
		Walks are magical. But also this reads partially like you got sent to a reeducation camp lol.