| ▲ | XCSme 3 hours ago | ||||||||||||||||
On some it does yes, also in real usage. It avoided answering 2/21 tests in this specific benchmark mark, that's already 90% max score already. | |||||||||||||||||
| ▲ | margalabargala 3 hours ago | parent [-] | ||||||||||||||||
I'm glad those tests apparently work out for you but a benchmark where three of the top 5 models are different flavors of Gemini Flash and zero are anything by Anthropic, is just so wildly divergent from my personal experience with the models that it's not useful to me. Whatever it is you're measuring, it's not anything related to what I use models for. | |||||||||||||||||
| |||||||||||||||||