| ▲ | XCSme 2 days ago | ||||||||||||||||||||||||||||||||||||||||
It depends... If the conclusion is: "DeepSeek v4 is this good, if you use it from DeepSeek" (which is how most people would use it anyway), then it makes sense to count API errors as failures. But, if the conclusion must be "The DeepSeek v4 model is this good when self-hosted and ran at ideal conditions", then the model should be tested locally, and skipping all invalid calls. I am still debating what should I do in this case, because showing a model as #1, and then people try to use it from their official provider and it fails half of the time, then that's also not a good leaderboard. I am considering adding a "reliability" column. Retry API errors until the test completes, BUT track how many retries was needed and compute a separate reliability score. But here comes a different problem: reliability varies over time and providers, so that's tougher to test. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | embedding-shape 2 days ago | parent [-] | ||||||||||||||||||||||||||||||||||||||||
Sounds like you're mixing and trying to measure two very different things, but placing them in the same category. One is the model itself, then there are reference conditions, and no such thing as "API failure". The other one is the reliability and uptime of a remote API endpoint for LLM inference. If you want to measure their API, do so, but don't place it under the same category as testing the model itself, as they're two different metrics. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||