Remix.run Logo
XCSme a day ago

I agree that it's confusing, I have already implemented a reliability score, but it will only apply for new tests from now on.

I have already re-tested DeepSeek v4, so it doesn't have any API error issues.

API errors are quite rare, most models tested have usually max 1 API Error failure reason, so fixing them won't change rankings much: https://aibenchy.com/fail/api-error/

I will try to retest all with API errors, so the score is only given by correct/wrong answers, and the reliability score will be an extra metric just as an indication of how the API performs.

That being said, the reliability of the API is still a huge factor for production use-cases.

XCSme a day ago | parent [-]

Some API errors are actually not about reliability, are because that specific API doesn't support some common features (e.g. specific structured_output formats).