Remix.run Logo
XCSme 2 days ago

Something is odd with this model, their blog posts shows REALLY good results, but in most other third-party benchmarks, people realize it's not really SOTA, even bellow Kimi K2.6 and GLM-5/5.1

In my tests too[0], it doesn't reach top 10. One issue, which they also mentioned in their post, is that they can't really serve well the model at the moment, so V4-Pro is heavily rate-limited and gives a lot of timeout errors when I try to test it. This shouldn't be an issue though, considering the model is open-source, but it makes it hard to accurately test at the moment.

[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...

Oras 2 days ago | parent | next [-]

I used pro via API (DeepSeek API not OpenRouter) with Claude Code, and the planning, visual solution, understanding was fantastic.

I would say I wouldn't notice this wasn't Opus 4.6. What I asked was looking at a feature implemented recently, and how it could be improved. Consumed 3.3 million tokens and create a much better flow.

It had a bug when I started the implementation though related to the API, which I suppose it is something they didn't catch when making their API compatible with CC.

dannyw 2 days ago | parent | prev | next [-]

Hmm, the Flash performs significantly better than Pro in the benchmark? That's very strange; could rate limiting cause that?

XCSme 2 days ago | parent [-]

Yes, Flash doesn't seem to have the same rate limits as Pro.

I expect once the API issues are fixed, for v4-pro to be around the same level as GLM-5.

wolttam 2 days ago | parent [-]

Why would your test be including scores of failed responses/runs? That seems confusing.

(I am confused by the results your website is presenting)

XCSme 2 days ago | parent [-]

Because the idea of those benchmarks is to see how well a model performs in real-world scenarios, as most models are served via APIs, not self-hosted.

So, for example, hypothetically if GPT-5.5 was super intelligent, but using it via API would fail 50% of the times, then using it in a real-life scenarios would make your workflows fail a lot more often than using a "dumber", but more stable model.

My plan is to also re-test models over-time, so this should account for infrastructure improvements and also to test for model "nerfing".

seanw265 2 days ago | parent | next [-]

I take some issue with that testing methodology. It seems to me that you're conflating the model's performance with the reliability of whatever provider you're using to run the benchmark.

Many models, especially open weight ones, are served by a variety of providers in their lifetime. Each provider has their own reliability statistics which can vary throughout a model's lifetime, as well as day to day and hour to hour.

Not to mention that there are plenty of gateways that track provider uptime and can intelligently route to the one most likely to complete your request.

BoorishBears 2 days ago | parent [-]

https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

That's not even the tip of the iceberg in how useless their benchmark is.

XCSme 2 days ago | parent | prev | next [-]

@seanw265 Yes, that's a problem. This can be solved for open-source models by running them on my own, but again the TPS will be dependent on the hardware used.

All models are tested through OpenRouter. The providers on OpenRouter vary drastically in quality, to the point where some simply serve broken models.

That being said, I usually test models a few hours after release, at which point, the only provider is the "official" one (e.g. Deepseek for their models, Alibaba for their own, etc.).

I don't really have any good solution for testing model reliability for closed-source models, BUT the outcome still holds: a model/provider that is more reliable, is statistically more likely to also give better results during at any given time.

A solution would be to regularly test models (e.g. every week), but I don't have the budget for that, as this is a hobby project for now.

wolttam 2 days ago | parent [-]

If you don't have the budget to test regularly, then including this kind of metric is questionable. You've essentially sampled the infrastructure's reliability at only a few points, which doesn't provide a very meaningful signal. It could mislead future readers about the performance of the overall system (either for the better or the worse).

I'd personally just try to test the model on the model's merits, not the infrastructure. The infrastructure is a constantly changing variable. Many infrastructure failures can be worked around by simply re-submitting the failed request automatically.

XCSme 2 days ago | parent [-]

> You've essentially sampled the infrastructure's reliability at only a few points, which doesn't provide a very meaningful signal

Well, sampling is still somewhat meaningful, but I agree with you, I am considering making a separate "reliability" score that counts how many times requests failed/timed out before completing.

XCSme 2 days ago | parent | prev | next [-]

@danyw, we reached max comment thread depth

Yes, I would. Currently I don't have that many tests (~20), and by default a test "run" includes 3 executions of each test. So, "bad luck" is already sort of solved in each run, by running each test 3 times.

dannyw 2 days ago | parent | prev [-]

Wouldn't you need to re-run across lots of samples (even for a single eval/bench) to avoid outsized impacts from just bad luck?

coder543 2 days ago | parent | prev | next [-]

Your “benchmark” is invalid. Penalizing the model because the hosting environment is being DDoSed by users a few hours after launch is utter nonsense.

I see that you tried to justify this lower in the thread, but no… it completely invalidates your benchmark. You are not testing the model. You are conflating one specific model host and model performance, and then claiming you are benchmarking the model. All major models are hosted by multiple different services.

In the real world, clients will just retry if there is a server error, and that will not impact response quality at all, and the workflow the model is being used in will not fail. If a workflow is so poorly coded that it doesn’t even have retry logic, then that workflow is doomed no matter which host you use. But again, reliability of the host is separate from the model.

You can make your benchmark valid by having separate leaderboards for model quality and host reliability. I’m not saying to throw the whole thing away. But the current claim is not valid.

And you’re also making an unsourced claim that everyone else has already determined this model sucks? Nah. The first result from Artificial Analysis shows good things: https://x.com/ArtificialAnlys/status/2047547434809880611

But I am still waiting to see the results from the full suite of AA benchmarks.

BoorishBears 2 days ago | parent [-]

Their benchmark is full of nonsense like this and I'm amazed the fact most of their interactions on the site are promoting it hasn't gotten the account banned for spam.

They have Gemini 2.5 Flash ahead of Opus 4.6: https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Absolutely worthless benchmark but every release has a comment linking to this nonsense.

embedding-shape 2 days ago | parent | prev [-]

> V4-Pro is heavily rate-limited and gives a lot of timeout errors when I try to test it. This shouldn't be an issue though, considering the model is open-source

Why does it matter if the model/architecture/weights are open source or not, given it's their proprietary inference hardware they're currently having issues with? Proprietary or not, the same issue would still be there on their platform.

XCSme 2 days ago | parent [-]

It depends...

If the conclusion is: "DeepSeek v4 is this good, if you use it from DeepSeek" (which is how most people would use it anyway), then it makes sense to count API errors as failures.

But, if the conclusion must be "The DeepSeek v4 model is this good when self-hosted and ran at ideal conditions", then the model should be tested locally, and skipping all invalid calls.

I am still debating what should I do in this case, because showing a model as #1, and then people try to use it from their official provider and it fails half of the time, then that's also not a good leaderboard.

I am considering adding a "reliability" column. Retry API errors until the test completes, BUT track how many retries was needed and compute a separate reliability score. But here comes a different problem: reliability varies over time and providers, so that's tougher to test.

embedding-shape 2 days ago | parent [-]

Sounds like you're mixing and trying to measure two very different things, but placing them in the same category. One is the model itself, then there are reference conditions, and no such thing as "API failure". The other one is the reliability and uptime of a remote API endpoint for LLM inference.

If you want to measure their API, do so, but don't place it under the same category as testing the model itself, as they're two different metrics.

XCSme 2 days ago | parent | next [-]

But how would you test a closed model independent of their API? For example, the speed score (tokens/s) is also variable and changes over time.

pertymcpert a day ago | parent [-]

I really don't understand how you don't understand how your site is completely misleading. Everyone here is telling you that including API reliability in with actual model performance is nonsense.

XCSme a day ago | parent [-]

I agree that it's confusing, I have already implemented a reliability score, but it will only apply for new tests from now on.

I have already re-tested DeepSeek v4, so it doesn't have any API error issues.

API errors are quite rare, most models tested have usually max 1 API Error failure reason, so fixing them won't change rankings much: https://aibenchy.com/fail/api-error/

I will try to retest all with API errors, so the score is only given by correct/wrong answers, and the reliability score will be an extra metric just as an indication of how the API performs.

That being said, the reliability of the API is still a huge factor for production use-cases.

XCSme a day ago | parent [-]

Some API errors are actually not about reliability, are because that specific API doesn't support some common features (e.g. specific structured_output formats).

a day ago | parent | prev [-]
[deleted]