Remix.run Logo
ozgune 2 days ago

I reviewed how DeepSeek V4-Pro, Kimi 2.6, Opus 4.6, and Opus 4.7 across the same AI benchmarks. All results are for Max editions, except for Kimi.

Summary: Opus 4.6 forms the baseline all three are trying to beat. DeepSeek V4-Pro roughly matches it across the board, Kimi K2.6 edges it on agentic/coding benchmarks, and Opus 4.7 surpasses it on nearly everything except web search.

DeepSeek V4-Pro Max shines in competitive coding benchmarks. However, it trails both Opus models on software engineering. Kimi K2.6 is remarkably competitive as an open-weight model. Its main weakness is in pure reasoning (GPQA, HMMT) where it trails Opus.

Speculation: The DeepSeek team wanted to come out with a model that surpassed proprietary ones. However, OpenAI dropped 5.4 and 5.5 and Anthropic released Opus 4.6 and 4.7. So they chose to just release V4 and iterate on it.

Basis for speculation? (i) The original reported timeline for the model was February. (ii) Their Hugging Face model card starts with "We present a preview version of DeepSeek-V4 series". (iii) V4 isn't multimodal yet (unlike the others) and their technical report states "We are also working on incorporating multimodal capabilities to our models."

solenoid0937 2 days ago | parent | next [-]

I feel like people suck at promoting Opus. Baseline, it's pretty on par with GPT 5.5.

But if you prompt it well - give it the reasoning behind why you're asking it to do something - it pulls far ahead.

hodgehog11 2 days ago | parent | next [-]

That's fine for procedural tasks, and I understand its value there. But these particular tasks I'm referring to occur on the front lines of research. You can't expect the prompts to be incredibly detailed, since those details are the whole challenge of the problem. I think there is value in having models that are capable of making really good preliminary insights to help guide the research.

adastra22 9 hours ago | parent [-]

really depends on your area of research

cultofmetatron a day ago | parent | prev [-]

I really wanted to get excited about opus but in my own real world usage, I wasn't getting much out of it before hitting my limits. meanwhile i can abuse codex on 5.5 for hours getting a whole lot of work done. Plus, open code and PI are much more fun and interesting harnesses to work from than claude code imho.

I will however say that claude work and design are really great up until i blow its limit.

arcanemachiner a day ago | parent | prev | next [-]

Would love to know how GLM 5.1 stacks up in this ranking. Seems like it's on par with Kimi K2.6.

bbertelsen 2 days ago | parent | prev [-]

I'd be interested to know when that Opus 4.6 baseline is from given their recent recognition of performance issues. Do you have a paper posted on this review?

ozgune a day ago | parent [-]

Ack. I took the benchmark results that AI Labs themselves published for their models. So the Opus 4.6 baseline would be from the time that Anthropic released the model.