Objective, detailed benchmark results at https://gertlabs.com

Early takeaways: from this release, DeepSeek V4 Flash is the model to pay attention to here. It's cheap, effective, and REALLY fast.

The Pro model is slow, not much better in coding reasoning so far when it works, and honestly too unreliable and rate limited to be of much use, currently. Hopefully that improves as new providers host the model. Flash is working fine, and is currently performing competitively with recent releases, but only on agentic workflows. Check back in 24 hours for full combined scoring with tool use and long context for both models.

Many of the frontier Chinese AI labs have released near-frontier models that are just a little bit behind Opus 4.6 in terms of speed, tool use ability, or long context handling. Open weights are winning the AI race, led by China. Crazy couple weeks of releases.

Mimo V2.5 Pro by Xiaomi (not open weights) is actually the best performer of the latest string of Chinese releases in our combined, comprehensive benchmarks, despite getting less attention. Kimi K2.6 is the most interesting open weights release, still. DeepSeek is not the leader in the space anymore.

An interesting pattern with the latest string of Chinese releases is the much better agentic boost (models are not as smart out of the box, but their ability to iterate in a loop with tools makes up most of the difference). Deepseek V4 Flash exemplifying this -- not a smart model on the first try, but it makes up for it over the course of a session.

▲

Squarex 2 days ago | parent | next [-]

I would say all benchmarks are inherently subjective. How is yours better? It seems to produce a little bit strange results. Opus 4.6 being worse than 4.5 for example. Or chinese models being rated too high. Kimi, Deepseek or GLM are all great in open source world, but I don't believe they are ahead of SOTA models from Anthropic, OpenAI or Google.

▲

gertlabs 2 days ago | parent | next [-]

No, some benchmarks are definitely objective, but most can be easily gamed. For example, most of the benchmarks on the model cards: they have measurable answers that don't rely on a human judge (a human made the question, but the answers are measuring some uncontroversial knowledge or capability). But because there is a single, correct answer, and those answer leak (or are randomly discovered and optimized for in training), they lose value over time, and regardless, they have a ceiling on the intelligence they can measure.

Others are purely subjective, like LMArena, which really only measures the personality and style preferences of the masses at this point, because frontier LLM technical answers are too hard for the average person to judge.

Then there are some interesting one-off benchmarks, but they lack enough rigor, breadth, and samples to draw larger conclusions from.

So we designed our benchmark with 3 goals: objective measurements (individual submissions not dependent on a human or LLM judge), no known correct answer (so simulations can scale to much higher levels of intelligence), and enough variety over important aspects of intelligence. We do this by running multiple models in cooperative/competitive environments with very complex action spaces and objective scoring, where model performance is relative and affected by the actions of other participants.

And yeah, there are some interesting results when you have a more objective benchmark. It should raise eyebrows when every single sub-release of every company's model is better across the board than its predecessor -- that isn't reality.

	▲	Squarex 2 days ago \| parent [-]
		The word "objective" just seems too authoritative to me.

▲

segmondy 2 days ago | parent | prev | next [-]

you are arguing with your belief instead of an objective truth. benchmark is more objective, if you don't agree with it, come up with a better one. but what you believe doesn't matter.

▲

Squarex 2 days ago | parent [-]

It was not a confrontational take. But all benchmarks are designed by humans, we are not that great at measuring intelligence. So it is somewhat subjective. I was just arguing with the word "objective". Not with the results per se.

	▲	swiftcoder a day ago \| parent [-]
		If the benchmark has a correct answer, the benchmark itself is an objective measure (but of what?). The "of what" may well be subjective

▲

tw1984 a day ago | parent | prev [-]

I agree that benchmarks are inherently subjective.

but the fact that you cite your brief as your main argument is funny - you don't even have any inherently subjective numbers to justify what you believe, you only have "I don't believe".

	▲	Squarex 15 hours ago \| parent [-]
		Sure, I have mixed up two things together. I don't think this benchmark is bad, I just did not like it is presented as the ultimate objective truth. The other thing I have mentioned is that it delivers different results from other benchmarks, so the "believe" stems from other benchmarks.

▲

dandaka 2 days ago | parent | prev | next [-]

Interesting that you rate Claude Opus 4.6 lower than 4.5 and 4.7, while community consensus puts it on top.

	▲	nostrebored a day ago \| parent [-]
		I think most hardcore people I know are still sticking with 4.5 for coding workflows

▲

kamranjon 2 days ago | parent | prev | next [-]

I'm particularly interested in it being REALLY fast - do you have any rough tok/s numbers for the flash model? I'm excited for unsloth to drop some quants that I can try and run locally, but really curious how it's been performing speed wise. In general I actually over-index on speed over intelligence. I'd rather a model make mistakes quickly and correct in a follow-up than take forever to get a slightly better initial result.

	▲	gertlabs 2 days ago \| parent [-]
		Take a look at the Time column in https://gertlabs.com/?mode=oneshot_coding -- this is the total time to complete a solution for a reasonably complex problem end-to-end (you would have to divide by avg submission size to estimate tok/s). It's fast in the sense that most of the smart, recent Chinese releases are quite slow, especially the DeepSeek Pro variant. Opus 4.7 is also quite fast. If pure speed is most important for your use case, GPT-5.3 Chat is the fastest model we've tested and it's still reasonably smart. Not meant for agentic tool usage / long context, though. So it might be more useful for business applications or non-engineering usage where you don't need exceptional intelligence, but it's useful to get fast, cheap responses.

▲

Lord_Zero 2 days ago | parent | prev [-]

Why no mention of GPT-5.5?

▲

gertlabs 2 days ago | parent [-]

Waiting on public API release. Once it drops, results will be up within 24 hours.

▲

gertlabs a day ago | parent [-]

Results are up. GPT 5.5 is a beast.

▲

wahnfrieden a day ago | parent [-]

Have you considered running models like GPT 5.5 inside their agent harness (Codex)?

▲

gertlabs a day ago | parent [-]

I see the value in that, but there are a few reasons that isn't on the immediate roadmap -- mainly, it shifts focus from measuring the model to measuring the harness. The agentic benchmark section you see on the site is comparable to how an agent would perform using an open harness like Pi. But latest tool-using models are pretty well adapted to any harness, so I think that's less of a factor in overall model performance.

▲

wahnfrieden a day ago | parent [-]

Just fresh on my mind after reading this from Codex team member re: performance difference between Pi and Codex app server usage: https://x.com/pashmerepat/status/2046865863979172039

▲

ZeroGravitas a day ago | parent [-]

Well that couldn't be vaguer if he tried. Basically saying, our stuff is better, no reasons given.

	▲	wahnfrieden 14 hours ago \| parent [-]
		Yeah that's why I'm advocating for measuring it in this thread. Some of these models are trained specifically for their official harnesses