Remix.run Logo
Stitch4223 6 hours ago

It’s four poorly constructed arbitrary experiments which say very little about the competency of either model.

The article reads like thin, auto-generated ai clickbait for nerd sniping or shilling a model.

Consider the lead:

> DeepSeek V4 Pro wins this head-to-head by being more exact where it matters: following instructions, matching schemas, and solving edge cases cleanly. GPT-5.5 Pro is still strong, but it gave away points with avoidable deviations.

“where it matters”, “cleanly”, “is still strong”, and vague references instead of telling 3 out of 4 tests Deepseek yielded more concise results.

1 star.

monooso 2 minutes ago | parent | next [-]

I think you've misunderstood the purpose of a lead (sic).

Per Merriam-Webster [^1], a lede is:

> the introductory section of a news story that is intended to entice the reader to read the full story

(Emphasis mine)

You may prefer more matter-of-fact phrasing, of course, but criticising a lede for attempting to achieve its goal is unjustified.

[^1]: https://www.merriam-webster.com/dictionary/lede

jampekka an hour ago | parent | prev [-]

(Three out of) four experiments is anecdotal for sure, but the result meshes with more established instruction following benchmarking (although DeepSeek V4 pro does not top these): https://artificialanalysis.ai/evaluations/ifbench

I found the writing clear and quite even handed. The lead is a bit salesy, but leads typically are. Knee-jerk dismissals based on vibes that something is LLM generated are quite low-effort.

zozbot234 an hour ago | parent [-]

It's picking strange tasks that don't really play to GPT-Pro's strengths (that model is roughly comparable to Mythos, intended for very hard reasoning and research-level problems) and then completely ignoring quite a few cases where GPT-Pro actually got some things more correct than DeepSeek did. The auto-AI ranking is just not reliable for this stuff.