Remix.run Logo
Alifatisk 6 hours ago

Have you all noted that the latest releases (Qwen3 max thinking, now Kimi k2.5) from Chinese companies are benching against Claude opus now and not Sonnet? They are truly catching up, almost at the same pace?

conception 2 hours ago | parent | next [-]

https://clocks.brianmoore.com

K2 is one of the only models to nail the clock face test as well. It’s a great model.

DJBunnies an hour ago | parent [-]

Cool comparison, but none of them get both the face and the time correct when I look at it.

esafak 35 minutes ago | parent | prev | next [-]

They are, in benchmarks. In practice Anthropic's models are ahead of where their benchmarks suggest.

WarmWash an hour ago | parent | prev | next [-]

They distill the major western models, so anytime a new SOTA model drops, you can expect the Chinese labs to update their models within a few months.

zozbot234 an hour ago | parent [-]

This is just a conspiracy theory/urban legend. How do you "distill" a proprietary model with no access to the original weights? Just doing the equivalent of training on chat/API logs has terrible effectiveness (you're trying to drink from a giant firehose with a tiny straw) and gives you no underlying improvements.

zozbot234 5 hours ago | parent | prev [-]

The benching is sus, it's way more important to look at real usage scenarios.