Sounds like you're mixing and trying to measure two very different things, but placing them in the same category. One is the model itself, then there are reference conditions, and no such thing as "API failure". The other one is the reliability and uptime of a remote API endpoint for LLM inference.

If you want to measure their API, do so, but don't place it under the same category as testing the model itself, as they're two different metrics.

▲

XCSme 2 days ago | parent | next [-]

But how would you test a closed model independent of their API? For example, the speed score (tokens/s) is also variable and changes over time.

▲

pertymcpert a day ago | parent [-]

I really don't understand how you don't understand how your site is completely misleading. Everyone here is telling you that including API reliability in with actual model performance is nonsense.

▲

XCSme a day ago | parent [-]

I agree that it's confusing, I have already implemented a reliability score, but it will only apply for new tests from now on.

I have already re-tested DeepSeek v4, so it doesn't have any API error issues.

API errors are quite rare, most models tested have usually max 1 API Error failure reason, so fixing them won't change rankings much: https://aibenchy.com/fail/api-error/

I will try to retest all with API errors, so the score is only given by correct/wrong answers, and the reliability score will be an extra metric just as an indication of how the API performs.

That being said, the reliability of the API is still a huge factor for production use-cases.

	▲	XCSme a day ago \| parent [-]
		Some API errors are actually not about reliability, are because that specific API doesn't support some common features (e.g. specific structured_output formats).

▲

a day ago | parent | prev [-]

[deleted]