| ▲ | XCSme a day ago | |
I agree that it's confusing, I have already implemented a reliability score, but it will only apply for new tests from now on. I have already re-tested DeepSeek v4, so it doesn't have any API error issues. API errors are quite rare, most models tested have usually max 1 API Error failure reason, so fixing them won't change rankings much: https://aibenchy.com/fail/api-error/ I will try to retest all with API errors, so the score is only given by correct/wrong answers, and the reliability score will be an extra metric just as an indication of how the API performs. That being said, the reliability of the API is still a huge factor for production use-cases. | ||
| ▲ | XCSme a day ago | parent [-] | |
Some API errors are actually not about reliability, are because that specific API doesn't support some common features (e.g. specific structured_output formats). | ||