| ▲ | XCSme 5 hours ago | |||||||||||||
Also check mine[0], basically random private tests/questions and an ok-ish methodology, testing mostly for general intelligence than coding-specific tasks. I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works). Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1]. [0]: https://aibenchy.com | ||||||||||||||
| ▲ | BoorishBears 2 hours ago | parent [-] | |||||||||||||
Every model release you'll post this, and every time I'll be there to point out how it's completely useless (for reasons you've shared are intentional) It does things like place the old Gemini 3 Flash above the more capable 3.5 Flash and Opus 4.5 - Opus 4.8 and gpt-5.5 At least, until hopefully one day HN has a rule about accounts that derive 99.9999% of their engagement with the site from shilling a personal project. | ||||||||||||||
| ||||||||||||||