| ▲ | pertymcpert a day ago | |||||||
I really don't understand how you don't understand how your site is completely misleading. Everyone here is telling you that including API reliability in with actual model performance is nonsense. | ||||||||
| ▲ | XCSme a day ago | parent [-] | |||||||
I agree that it's confusing, I have already implemented a reliability score, but it will only apply for new tests from now on. I have already re-tested DeepSeek v4, so it doesn't have any API error issues. API errors are quite rare, most models tested have usually max 1 API Error failure reason, so fixing them won't change rankings much: https://aibenchy.com/fail/api-error/ I will try to retest all with API errors, so the score is only given by correct/wrong answers, and the reliability score will be an extra metric just as an indication of how the API performs. That being said, the reliability of the API is still a huge factor for production use-cases. | ||||||||
| ||||||||