It depends...

If the conclusion is: "DeepSeek v4 is this good, if you use it from DeepSeek" (which is how most people would use it anyway), then it makes sense to count API errors as failures.

But, if the conclusion must be "The DeepSeek v4 model is this good when self-hosted and ran at ideal conditions", then the model should be tested locally, and skipping all invalid calls.

I am still debating what should I do in this case, because showing a model as #1, and then people try to use it from their official provider and it fails half of the time, then that's also not a good leaderboard.

I am considering adding a "reliability" column. Retry API errors until the test completes, BUT track how many retries was needed and compute a separate reliability score. But here comes a different problem: reliability varies over time and providers, so that's tougher to test.

▲

embedding-shape 2 days ago | parent [-]

Sounds like you're mixing and trying to measure two very different things, but placing them in the same category. One is the model itself, then there are reference conditions, and no such thing as "API failure". The other one is the reliability and uptime of a remote API endpoint for LLM inference.

If you want to measure their API, do so, but don't place it under the same category as testing the model itself, as they're two different metrics.

▲

XCSme 2 days ago | parent | next [-]

But how would you test a closed model independent of their API? For example, the speed score (tokens/s) is also variable and changes over time.

▲

pertymcpert 2 days ago | parent [-]

I really don't understand how you don't understand how your site is completely misleading. Everyone here is telling you that including API reliability in with actual model performance is nonsense.

▲

XCSme a day ago | parent [-]

I agree that it's confusing, I have already implemented a reliability score, but it will only apply for new tests from now on.

I have already re-tested DeepSeek v4, so it doesn't have any API error issues.

API errors are quite rare, most models tested have usually max 1 API Error failure reason, so fixing them won't change rankings much: https://aibenchy.com/fail/api-error/

I will try to retest all with API errors, so the score is only given by correct/wrong answers, and the reliability score will be an extra metric just as an indication of how the API performs.

That being said, the reliability of the API is still a huge factor for production use-cases.

	▲	XCSme a day ago \| parent [-]
		Some API errors are actually not about reliability, are because that specific API doesn't support some common features (e.g. specific structured_output formats).

▲

2 days ago | parent | prev [-]

[deleted]