| ▲ | dekhn 5 hours ago | |||||||
Using a single custom benchmark as a metric seems pretty unreliable to me. Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion. | ||||||||
| ▲ | prodigycorp 3 hours ago | parent [-] | |||||||
after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run. This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se. While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted. my bad to the google team for the cursory brush off. | ||||||||
| ||||||||